Jared AI Hub
Published on

Understanding Text Embeddings: From Words to Meaning

Authors
  • avatar
    Name
    Jared Chung
    Twitter

Introduction

Text embeddings transform words, sentences, or documents into numerical vectors that capture semantic meaning. They're the foundation of modern NLP—nearly every application from search to chatbots relies on them.

The key insight: similar concepts have similar vectors. This enables computers to understand that "happy" and "joyful" are related, even though they share no characters.

This guide traces the evolution of embeddings and helps you choose the right approach for your use case.

The Evolution of Text Representations

Evolution of Text Embeddings

Why Traditional Methods Fall Short

Before embeddings, text was represented using sparse methods:

One-Hot Encoding:

Vocabulary: [cat, dog, bird]
cat  = [1, 0, 0]
dog  = [0, 1, 0]
bird = [0, 0, 1]

Problems:

  • Vocabulary of 100,000 words = 100,000-dimensional vectors
  • No similarity: cosine(cat, dog) = 0, same as cosine(cat, quantum)
  • Cannot handle words not in the vocabulary

TF-IDF:

  • Weights words by importance (term frequency × inverse document frequency)
  • Better than one-hot, but still high-dimensional and sparse
  • Treats "happy" and "joyful" as completely unrelated

The Embedding Solution

Embeddings learn dense, low-dimensional vectors where semantically similar items cluster together:

WordDense Embedding (conceptual)
cat[0.2, -0.4, 0.1, ..., 0.8]
dog[0.3, -0.3, 0.2, ..., 0.7]
car[-0.5, 0.2, -0.1, ..., -0.3]

Now cosine(cat, dog) ≈ 0.8 (similar animals), while cosine(cat, car) ≈ 0.1 (unrelated).

Word Embeddings

Word2Vec: The Pioneer (2013)

Word2Vec, from Mikolov et al. at Google, demonstrated that neural networks could learn meaningful word representations from raw text.

The Core Insight: Distributional Hypothesis

"You shall know a word by the company it keeps" - J.R. Firth (1957)

Words appearing in similar contexts have similar meanings. Word2Vec learns by predicting context:

Skip-gram (predict context from word):

Sentence: "The cat sat on the mat"
Target: "sat"
Predict: ["The", "cat", "on", "the"]

Training creates embeddings where "sat" is similar to other verbs
that appear in similar contexts ("stood", "lay", "slept")

Word Analogies

Word2Vec famously captured relationships through vector arithmetic:

king - man + woman ≈ queen
paris - france + japan ≈ tokyo

This works because relationships are encoded as directions in the vector space.

Limitations

LimitationExample
Static embeddings"bank" has one vector for both "river bank" and "bank account"
No OOV handlingUnknown words have no representation
Word-level onlyNo sentence or document embeddings

GloVe: Global Vectors (2014)

Stanford's GloVe takes a different approach: directly factorize the word co-occurrence matrix.

Key insight: The ratio of co-occurrence probabilities encodes meaning.

Word kP(k|ice)P(k|steam)Ratio
solidhighlow≫ 1
gaslowhigh≪ 1
watermediummedium≈ 1

Words related to "ice" but not "steam" have high ratios. GloVe learns embeddings that capture these ratios.

Comparison to Word2Vec:

  • Similar quality for most tasks
  • GloVe uses global statistics, Word2Vec uses local context windows
  • Both produce static word embeddings

FastText: Subword Embeddings (2016)

Facebook's FastText represents words as bags of character n-grams.

The word "where" with n=3 becomes:

<wh, whe, her, ere, re>, <where>

Key benefits:

BenefitWhy It Matters
OOV handlingCan generate vectors for unseen words
MorphologyCaptures prefixes, suffixes, roots
Rare wordsBetter representations through subword sharing
# FastText can handle words it's never seen
fasttext_model["deeplearning"]  # Works! (combines subwords)
word2vec_model["deeplearning"]  # KeyError!

Sentence Embeddings

The Context Problem

Word embeddings assign one vector per word regardless of context:

"I deposited money in the bank"  → bank = [0.5, -0.3, ...]
"I sat by the river bank"        → bank = [0.5, -0.3, ...]  # Same!

This is wrong—the meaning of "bank" depends on context.

Sentence Transformers (SBERT, 2019)

Sentence-BERT generates sentence-level embeddings that:

  1. Understand context within the sentence
  2. Can be compared with simple cosine similarity
  3. Are efficient enough for real-time applications

Why Not Use BERT Directly?

BERT wasn't designed for sentence similarity. Using BERT naively requires:

  • Passing both sentences through BERT together
  • O(n²) comparisons for n sentences

For 10,000 sentences, that's 50+ million forward passes!

SBERT generates independent embeddings: O(n) forward passes, then fast vector comparison.

Practical Usage

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = [
    "Machine learning is a subset of AI",
    "AI systems can learn from experience",
    "The weather is nice today",
]

embeddings = model.encode(sentences)

# Similar concepts cluster together
print(cosine_similarity([embeddings[0]], [embeddings[1]]))  # ~0.75
print(cosine_similarity([embeddings[0]], [embeddings[2]]))  # ~0.15

Model Selection

ModelDimensionsSpeedQualityUse Case
all-MiniLM-L6-v2384FastGoodGeneral purpose, production
all-mpnet-base-v2768MediumBestWhen quality matters most
all-MiniLM-L12-v2384FastBetterBalance of speed/quality
paraphrase-multilingual-*384FastGood50+ languages

Rule of thumb: Start with all-MiniLM-L6-v2. Only upgrade if you need better quality and can accept slower inference.

When to Use Each Approach

Decision Framework

Your NeedRecommended Approach
Simple word similarityWord2Vec or GloVe
Handle unknown wordsFastText
Sentence/paragraph similaritySentence Transformers
Semantic searchSentence Transformers
Clustering documentsSentence Transformers
Word analogiesWord2Vec
Very limited computeWord2Vec (pretrained)

Modern Default

For most NLP tasks in 2024+, start with Sentence Transformers. Only use word embeddings for specific use cases like word analogy tasks or when computational resources are extremely limited.

Practical Patterns

from sentence_transformers import SentenceTransformer, util
import torch

class SemanticSearch:
    def __init__(self, model_name='all-MiniLM-L6-v2'):
        self.model = SentenceTransformer(model_name)
        self.corpus = []
        self.corpus_embeddings = None

    def index(self, documents):
        self.corpus = documents
        self.corpus_embeddings = self.model.encode(
            documents, convert_to_tensor=True
        )

    def search(self, query, top_k=5):
        query_embedding = self.model.encode(query, convert_to_tensor=True)
        scores = util.cos_sim(query_embedding, self.corpus_embeddings)[0]
        top_indices = torch.topk(scores, k=min(top_k, len(self.corpus))).indices

        return [
            {"text": self.corpus[idx], "score": scores[idx].item()}
            for idx in top_indices
        ]

# Usage
search = SemanticSearch()
search.index(["Python programming", "Machine learning basics", "Web development"])
results = search.search("How do I learn AI?")

Batch Processing

Always encode in batches for efficiency:

# Slow: one at a time
embeddings = [model.encode(s) for s in sentences]

# Fast: batch encode
embeddings = model.encode(sentences, batch_size=32)

GPU Acceleration

import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = SentenceTransformer('all-MiniLM-L6-v2', device=device)

GPU acceleration provides 5-10x speedup for batch encoding.

Handling Long Documents

Models have maximum sequence lengths (typically 256-512 tokens):

StrategyWhen to Use
TruncationSummary information is at the start
Chunking + averagingInformation spread throughout
Chunking + max poolingKey info in specific sections
Long-context modelsNeed full document understanding
def embed_long_document(text, model, chunk_size=256, overlap=50):
    """Embed long documents by chunking and averaging."""
    words = text.split()
    chunks = []

    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        chunks.append(chunk)

    chunk_embeddings = model.encode(chunks)
    return chunk_embeddings.mean(axis=0)  # Average pooling

Domain Adaptation

Pre-trained models work well for general text but may underperform on specialized domains (medical, legal, financial).

When to Fine-tune

ScenarioAction
General text, good resultsUse pretrained
Domain terms, acceptable resultsUse pretrained
Domain terms, poor resultsFine-tune
Critical accuracy needsFine-tune

Fine-tuning Approaches

  1. Similarity pairs: Pairs of texts with similarity scores (0-1)
  2. Triplets: (anchor, positive, negative) examples
  3. Contrastive: Similar/dissimilar pairs

Typically need 1,000-10,000 examples for meaningful improvement.

Best Practices Summary

Do

  • Batch encode for efficiency
  • Normalize vectors before cosine similarity
  • Use GPU when available
  • Start with pretrained models before fine-tuning
  • Test on your actual data before choosing a model

Don't

  • Don't encode one sentence at a time in loops
  • Don't assume pretrained models work for specialized domains
  • Don't ignore sequence length limits for long documents
  • Don't mix embeddings from different models in the same vector space

Conclusion

Text embeddings have evolved from sparse representations to powerful dense vectors that capture meaning:

  • Word2Vec/GloVe: Pioneered dense word representations, limited by static nature
  • FastText: Added subword information, handles unknown words
  • Sentence Transformers: Current state-of-the-art for semantic similarity

For modern NLP applications, start with Sentence Transformers (all-MiniLM-L6-v2) and only use word embeddings for specific use cases where they excel.

References