Jared AI Hub
Published on

Named Entity Recognition: Extracting Structured Information from Text

Authors
  • avatar
    Name
    Jared Chung
    Twitter

Introduction

Named Entity Recognition (NER) is a fundamental NLP task that extracts structured information from unstructured text. It identifies and classifies mentions of real-world entities into categories like people, organizations, locations, and dates.

NER powers many practical applications:

  • Email processing: Extract contacts, meetings, and action items
  • News analysis: Identify companies, people, and events mentioned
  • Resume parsing: Extract skills, education, and work history
  • Data anonymization: Detect and redact PII (personally identifiable information)
  • Knowledge graphs: Build structured databases from documents

This guide explains how NER works, compares different approaches, and shows how to handle domain-specific entities.

How NER Works

Named Entity Recognition Pipeline

The NER Pipeline

A typical NER system processes text in stages:

  1. Tokenization: Split text into individual words or subwords
  2. Encoding: Convert tokens to numerical representations
  3. Classification: Predict entity labels for each token
  4. Span extraction: Combine labeled tokens into entity spans

BIO Tagging Scheme

NER operates at the token level, but entities often span multiple tokens. The BIO scheme handles this:

  • B-XXX: Beginning of entity type XXX
  • I-XXX: Inside (continuation) of entity type XXX
  • O: Outside any entity

Example:

TokenLabelMeaning
AppleB-ORGStart of organization
Inc.I-ORGContinuation of organization
CEOONot an entity
TimB-PERStart of person
CookI-PERContinuation of person
visitedONot an entity
ParisB-LOCLocation (single token)

The BIO scheme ensures multi-word entities like "Apple Inc." and "Tim Cook" are correctly grouped.

Standard Entity Types

Different datasets define different entity categories. The most common:

OntoNotes 5.0 (spaCy default)

TypeDescriptionExamples
PERSONPeople, including fictionalBarack Obama, Sherlock Holmes
ORGCompanies, agencies, institutionsGoogle, FBI, Stanford
GPECountries, cities, statesFrance, New York
LOCNon-GPE locationsMount Everest, Pacific Ocean
DATEDates and periodsJune 2023, yesterday
TIMETimes of day3:00 PM, morning
MONEYMonetary values$500, fifty euros
PERCENTPercentages25%, three percent
PRODUCTObjects, vehiclesiPhone, Boeing 747
EVENTNamed eventsOlympics, World War II

CoNLL-2003 (Standard benchmark)

TypeDescription
PERPerson names
ORGOrganizations
LOCLocations
MISCMiscellaneous

The entity types you need depend on your application. Custom domains (medical, legal, financial) typically require custom entity types.

Approaches to NER

There are three main approaches, each with different tradeoffs:

1. Rule-Based (Pattern Matching)

Match explicit patterns like regular expressions or keyword lists.

When to use:

  • Known, finite set of entities (product names, internal codes)
  • High precision required (legal compliance)
  • No training data available

Characteristics:

AspectRating
Accuracy on known patternsExcellent
Generalization to unseen textPoor
SpeedVery Fast
Maintenance effortHigh
import spacy

nlp = spacy.blank("en")
ruler = nlp.add_pipe("entity_ruler")

patterns = [
    {"label": "PRODUCT", "pattern": "iPhone 15 Pro"},
    {"label": "PRODUCT", "pattern": [{"LOWER": "iphone"}, {"IS_DIGIT": True}]},
    {"label": "TECH", "pattern": [{"LOWER": {"IN": ["pytorch", "tensorflow"]}}]},
]
ruler.add_patterns(patterns)

doc = nlp("I use PyTorch on my iPhone 15")
for ent in doc.ents:
    print(f"{ent.text}{ent.label_}")

2. Statistical Models (spaCy)

Machine learning models trained on labeled data, using features like word shape, context, and embeddings.

When to use:

  • General-purpose NER on standard entity types
  • Production systems requiring speed
  • Balance between accuracy and performance

Characteristics:

AspectRating
AccuracyGood
GeneralizationGood
SpeedFast
Resource usageLow
import spacy

nlp = spacy.load("en_core_web_sm")  # or en_core_web_trf for better accuracy

text = """
Elon Musk, CEO of Tesla, announced a $5 billion investment in Berlin.
The press conference was held at Tesla headquarters in Palo Alto.
"""

doc = nlp(text)
for ent in doc.ents:
    print(f"{ent.text:20}{ent.label_}")

Output:

Elon Musk            → PERSON
Tesla                → ORG
$5 billion           → MONEY
Berlin               → GPE
Tesla                → ORG
Palo Alto            → GPE

3. Transformer Models (BERT, RoBERTa)

Deep learning models that understand context deeply through self-attention.

When to use:

  • Highest accuracy required
  • Ambiguous entities that need context
  • Domain-specific fine-tuning

Characteristics:

AspectRating
AccuracyExcellent
GeneralizationVery Good
SpeedSlower
Resource usageHigh (GPU preferred)
from transformers import pipeline

ner = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")

text = "Microsoft CEO Satya Nadella announced a partnership with OpenAI."
entities = ner(text)

for ent in entities:
    print(f"{ent['word']:20}{ent['entity_group']} ({ent['score']:.3f})")

Choosing an Approach

RequirementBest Approach
Known patterns, high precisionRule-based
General NER, production speedspaCy statistical
Maximum accuracyTransformer
Mixed (known + general)Rule-based + Statistical
Domain-specific entitiesFine-tuned Transformer

Hybrid approach: Combine rule-based patterns for known entities with statistical models for general coverage. spaCy's EntityRuler can run before or after the statistical NER.

Training Custom NER Models

Pre-trained models don't know your domain-specific entities. Custom training is needed for:

  • Medical terms (drug names, conditions, procedures)
  • Legal entities (case citations, contract clauses)
  • Financial data (ticker symbols, financial instruments)
  • Product catalogs (your company's products)

Training Data Requirements

NER training requires token-level labeled data in BIO format:

Dataset SizeTypical Results
50-100 examplesBasic recognition, many errors
200-500 examplesReasonable accuracy for common patterns
1000+ examplesGood generalization
5000+ examplesProduction-quality for most use cases

Quality matters more than quantity. 500 diverse, well-labeled examples beat 2000 noisy ones.

Training with spaCy

spaCy training uses annotated examples with character offsets:

import spacy
from spacy.training import Example
import random

# Training data: (text, {"entities": [(start, end, label), ...]})
TRAIN_DATA = [
    ("iPhone 15 Pro has a titanium frame", {"entities": [(0, 13, "PRODUCT")]}),
    ("The M3 chip delivers great performance", {"entities": [(4, 11, "HARDWARE")]}),
    ("macOS Sonoma includes new features", {"entities": [(0, 12, "SOFTWARE")]}),
    ("Download Xcode from the App Store", {"entities": [(9, 14, "SOFTWARE"), (24, 33, "PRODUCT")]}),
    # Add 50-200+ examples per entity type
]

def train_ner(train_data, n_iter=30):
    nlp = spacy.blank("en")
    ner = nlp.add_pipe("ner")

    # Add labels
    for _, annotations in train_data:
        for start, end, label in annotations.get("entities", []):
            ner.add_label(label)

    # Convert to Example objects
    examples = [Example.from_dict(nlp.make_doc(text), ann) for text, ann in train_data]
    nlp.initialize(lambda: examples)

    # Training loop
    for i in range(n_iter):
        random.shuffle(examples)
        losses = {}
        for example in examples:
            nlp.update([example], drop=0.35, losses=losses)

    return nlp

nlp = train_ner(TRAIN_DATA)
nlp.to_disk("custom_ner")

Fine-tuning Transformers

For higher accuracy, fine-tune a transformer model:

  1. Prepare data in token-level format with BIO labels
  2. Tokenize carefully - align labels with subword tokens
  3. Fine-tune using HuggingFace Trainer

Key considerations:

  • Subword alignment: When "iPhone" becomes ["i", "##Phone"], only the first subword gets the label
  • Learning rate: Use 2e-5 to 5e-5 for fine-tuning
  • Epochs: 3-10 depending on dataset size

Evaluation Metrics

NER uses entity-level metrics, not token-level:

MetricDefinition
PrecisionOf predicted entities, what % are correct?
RecallOf actual entities, what % did we find?
F1 ScoreHarmonic mean of precision and recall

Strict matching: Both entity boundaries AND type must be exactly correct.

from seqeval.metrics import classification_report

y_true = [["O", "B-PER", "I-PER", "O", "B-ORG"]]
y_pred = [["O", "B-PER", "I-PER", "O", "O"]]  # Missed ORG

print(classification_report(y_true, y_pred))

Output shows per-entity-type metrics:

  • PERSON: 100% (correctly identified)
  • ORG: 0% (missed entirely)

Common Evaluation Mistakes

  1. Token-level accuracy misleads: 95% token accuracy can mean 60% entity F1 if boundaries are wrong
  2. O-class dominates: Most tokens aren't entities, so token accuracy looks artificially high
  3. Partial matches: Identifying "Tim" instead of "Tim Cook" counts as a complete miss in strict evaluation

Common Challenges

1. Ambiguous Entities

The same text can be different entity types depending on context:

"Apple released new products"  → Apple = ORG
"I ate an apple for lunch"     → apple = not an entity
"Apple Martin is an actress"   → Apple Martin = PERSON

Solution: Transformer models handle context better than rule-based systems. They consider surrounding words when classifying.

2. Nested Entities

Some entities contain other entities:

"Bank of America headquarters in Charlotte"
- "Bank of America" → ORG
- "America" → GPE (nested inside ORG)
- "Charlotte" → GPE

Standard BIO tagging can't represent nesting. Solutions:

  • Use only the outermost entity
  • Use spaCy's SpanCategorizer for overlapping spans
  • Use specialized nested NER models

3. Long Documents

Transformer models have token limits (typically 512):

def process_long_doc(text, nlp, chunk_size=500, overlap=50):
    """Process long documents with overlapping chunks."""
    entities = []
    start = 0

    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunk = text[start:end]

        doc = nlp(chunk)
        for ent in doc.ents:
            entities.append({
                "text": ent.text,
                "label": ent.label_,
                "start": start + ent.start_char,
                "end": start + ent.end_char
            })

        start = end - overlap  # Overlap to catch split entities

    return deduplicate(entities)  # Remove duplicates from overlap

4. Domain-Specific Entities

Pre-trained models don't recognize specialized terminology:

DomainCustom Entities Needed
MedicalDrug names, conditions, procedures
LegalCase citations, contract terms
FinanceTicker symbols, financial products
E-commerceProduct SKUs, brand names

Solutions:

  1. Add rule-based patterns for known terms
  2. Fine-tune on domain-specific data
  3. Use domain-specific pre-trained models (BioBERT, LegalBERT)

Production Considerations

Batch Processing

Process multiple texts efficiently:

texts = ["Text 1...", "Text 2...", "Text 3..."]

# spaCy's nlp.pipe is much faster than individual calls
for doc in nlp.pipe(texts, batch_size=50):
    entities = [(ent.text, ent.label_) for ent in doc.ents]

Model Selection

ModelSpeedAccuracyMemory
en_core_web_smFastGood12MB
en_core_web_mdMediumBetter40MB
en_core_web_lgMediumBetter560MB
en_core_web_trfSlowBest440MB
Custom BERTSlowDomain-best~500MB

For production:

  • Start with en_core_web_sm for speed
  • Upgrade to trf if accuracy is insufficient
  • Fine-tune only if pre-trained models don't cover your entities

Combining Approaches

A practical production system often combines multiple approaches:

  1. EntityRuler first: Catch known patterns with high confidence
  2. Statistical NER: Handle general entities
  3. Post-processing: Apply business rules (validation, deduplication)
nlp = spacy.load("en_core_web_sm")

# Add rule-based patterns BEFORE statistical NER
ruler = nlp.add_pipe("entity_ruler", before="ner")
ruler.add_patterns([
    {"label": "PRODUCT", "pattern": "iPhone 15 Pro"},
    {"label": "INTERNAL_CODE", "pattern": [{"TEXT": {"REGEX": "PRD-\\d{6}"}}]},
])

# Now both custom patterns and statistical NER run
doc = nlp("Order PRD-123456 for iPhone 15 Pro shipped to John Smith")

Conclusion

NER extracts structured entities from unstructured text. Key takeaways:

Understanding NER:

  • BIO tagging handles multi-token entities
  • Entity types depend on your use case
  • Evaluation must be entity-level, not token-level

Choosing an approach:

  • Rule-based for known patterns
  • spaCy statistical for general production use
  • Transformers for maximum accuracy
  • Combine approaches for best results

Custom training:

  • Needed for domain-specific entities
  • Quality of labels matters more than quantity
  • 200-500 diverse examples is a good starting point

Production tips:

  • Use batch processing (nlp.pipe)
  • Start simple, add complexity as needed
  • Monitor and iterate on real data

References