Jared AI Hub
Published on

Named Entity Recognition (NER): From Rule-Based to Transformer Models

Authors
  • avatar
    Name
    Jared Chung
    Twitter

Introduction

Named Entity Recognition (NER) is a fundamental NLP task that identifies and classifies named entities in text into predefined categories such as person names, organizations, locations, dates, and more. NER is the foundation of information extraction pipelines and powers features like:

  • Extracting contacts from emails
  • Identifying companies mentioned in news articles
  • Parsing resumes for candidate information
  • Anonymizing sensitive data (PII detection)
  • Building knowledge graphs

In this comprehensive guide, we'll explore NER from basic concepts to production implementations.

Prerequisites

# Core packages
pip install spacy transformers torch datasets seqeval

# Download spaCy models
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_trf  # Transformer-based (better but slower)

Understanding NER

What Does NER Do?

NER takes unstructured text and extracts structured information:

Input:  "Apple Inc. was founded by Steve Jobs in Cupertino, California in 1976."

Output:
- Apple Inc.   → ORGANIZATION (ORG)
- Steve Jobs   → PERSON (PER)
- Cupertino    → GEOPOLITICAL ENTITY (GPE)
- California   → GEOPOLITICAL ENTITY (GPE)
- 1976         → DATE

Standard Entity Types

Different datasets define different entity types. Here are the most common:

OntoNotes 5.0 (spaCy default)

TypeDescriptionExamples
PERSONPeople, including fictionalBarack Obama, Sherlock Holmes
ORGCompanies, agencies, institutionsGoogle, FBI, Stanford University
GPECountries, cities, statesFrance, New York, California
LOCNon-GPE locationsMount Everest, Pacific Ocean
DATEAbsolute or relative datesJune 2023, yesterday, next week
TIMETimes smaller than a day3:00 PM, morning
MONEYMonetary values$500, fifty euros
PERCENTPercentages25%, three percent
CARDINALNumerals not covered by other typesone, 1000, dozens
ORDINALFirst, second, etc.first, 3rd
PRODUCTObjects, vehicles, foods (not services)iPhone, Boeing 747
EVENTNamed eventsWorld War II, Olympics
WORK_OF_ARTTitles of books, songs, etc.Harry Potter, Let It Be
LAWNamed documents made into lawsRoe v. Wade, GDPR
LANGUAGENamed languagesEnglish, Spanish

CoNLL-2003 (Common benchmark)

TypeDescription
PERPerson names
ORGOrganizations
LOCLocations
MISCMiscellaneous entities

BIO Tagging Scheme

Internally, NER uses token-level labeling. The BIO scheme marks:

  • B-XXX: Beginning of entity type XXX
  • I-XXX: Inside (continuation) of entity type XXX
  • O: Outside any entity
Token:  Apple  Inc.  was  founded  by  Steve  Jobs  in  Cupertino
Label:  B-ORG  I-ORG  O    O       O   B-PER  I-PER O   B-GPE

This handles multi-word entities correctly.

spaCy: Production-Ready NER

spaCy is the go-to library for production NLP. It's fast, accurate, and easy to use.

Basic Usage

import spacy
from spacy import displacy

# Load the small English model (faster, less accurate)
nlp = spacy.load("en_core_web_sm")

# Or load the transformer model (slower, more accurate)
# nlp = spacy.load("en_core_web_trf")

text = """
Elon Musk, CEO of Tesla and SpaceX, announced on Monday that the company
will open a new Gigafactory in Berlin, Germany. The $5 billion investment
was confirmed during a press conference at the Tesla headquarters in
Palo Alto, California.
"""

# Process the text
doc = nlp(text)

# Extract entities
print("=== Named Entities ===")
for ent in doc.ents:
    print(f"{ent.text:20} | {ent.label_:12} | {ent.start_char:4}-{ent.end_char:4}")

Output:

Elon Musk            | PERSON       |    1-10
Tesla                | ORG          |   19-24
SpaceX               | ORG          |   29-35
Monday               | DATE         |   50-56
Gigafactory          | ORG          |   97-108
Berlin               | GPE          |  112-118
Germany              | GPE          |  120-127
$5 billion           | MONEY        |  133-143
Tesla                | ORG          |  202-207
Palo Alto            | GPE          |  224-233
California           | GPE          |  235-245

Visualizing Entities

from spacy import displacy

# In Jupyter notebook
displacy.render(doc, style="ent", jupyter=True)

# Generate HTML file
html = displacy.render(doc, style="ent", page=True)
with open("entities.html", "w", encoding="utf-8") as f:
    f.write(html)
print("Saved visualization to entities.html")

# Customize colors
colors = {"ORG": "#ff6b6b", "PERSON": "#4ecdc4", "GPE": "#45b7d1"}
options = {"colors": colors}
html = displacy.render(doc, style="ent", page=True, options=options)

Accessing Entity Details

for ent in doc.ents:
    print(f"Text: {ent.text}")
    print(f"  Label: {ent.label_} ({spacy.explain(ent.label_)})")
    print(f"  Start token: {ent.start}, End token: {ent.end}")
    print(f"  Start char: {ent.start_char}, End char: {ent.end_char}")
    print(f"  Root: {ent.root.text}")
    print()

Custom Rules with EntityRuler

Add pattern-based entity recognition for domain-specific terms:

import spacy
from spacy.pipeline import EntityRuler

nlp = spacy.load("en_core_web_sm")

# Add entity ruler BEFORE the NER component
ruler = nlp.add_pipe("entity_ruler", before="ner")

# Define patterns
patterns = [
    # Exact match
    {"label": "PRODUCT", "pattern": "iPhone 15 Pro"},
    {"label": "PRODUCT", "pattern": "MacBook Pro"},

    # Token-based patterns (more flexible)
    {"label": "PRODUCT", "pattern": [{"LOWER": "iphone"}, {"IS_DIGIT": True}]},
    {"label": "TECH", "pattern": [{"LOWER": {"IN": ["pytorch", "tensorflow", "keras", "scikit-learn"]}}]},

    # Regex-like patterns
    {"label": "VERSION", "pattern": [{"TEXT": {"REGEX": "v\\d+\\.\\d+(\\.\\d+)?"}}]},

    # Complex patterns
    {"label": "PROGRAMMING_LANG", "pattern": [
        {"LOWER": {"IN": ["python", "javascript", "java", "rust", "go", "c++"]}},
    ]},
]

ruler.add_patterns(patterns)

# Test
test_texts = [
    "I use PyTorch for deep learning on my MacBook Pro",
    "The latest iPhone 15 Pro has a titanium frame",
    "We upgraded to Python 3.11 and TensorFlow 2.15",
    "The API is now at v2.3.1",
]

for text in test_texts:
    doc = nlp(text)
    print(f"\n'{text}'")
    for ent in doc.ents:
        print(f"  {ent.text}{ent.label_}")

Combining Rules and ML

The EntityRuler can work alongside the statistical NER:

# Priority: EntityRuler patterns override statistical NER when they overlap
nlp = spacy.load("en_core_web_sm")

# Add AFTER ner to let patterns override
ruler = nlp.add_pipe("entity_ruler", after="ner")

# Or add BEFORE ner to let statistical model override
# ruler = nlp.add_pipe("entity_ruler", before="ner")

patterns = [
    {"label": "COMPANY", "pattern": "Apple"},  # Override default ORG
]
ruler.add_patterns(patterns)

doc = nlp("Apple announced new products")
for ent in doc.ents:
    print(f"{ent.text}{ent.label_}")

Transformer-Based NER with Hugging Face

For state-of-the-art accuracy, use transformer models.

Using Pre-trained NER Models

from transformers import pipeline

# Load NER pipeline with a fine-tuned model
ner_pipeline = pipeline(
    "ner",
    model="dslim/bert-base-NER",
    aggregation_strategy="simple"  # Merge B-/I- tokens
)

text = """
Microsoft CEO Satya Nadella announced a $10 billion investment in OpenAI.
The partnership was revealed at their Redmond, Washington headquarters.
"""

# Run NER
entities = ner_pipeline(text)

print("=== Transformer NER Results ===")
for entity in entities:
    print(f"{entity['word']:20} | {entity['entity_group']:10} | {entity['score']:.4f}")

Output:

Microsoft            | ORG        | 0.9985
Satya Nadella        | PER        | 0.9991
OpenAI               | ORG        | 0.9972
Redmond              | LOC        | 0.9988
Washington           | LOC        | 0.9976

Comparing Different Models

from transformers import pipeline

models = [
    "dslim/bert-base-NER",
    "Jean-Baptiste/roberta-large-ner-english",
    "xlm-roberta-large-finetuned-conll03-english",
]

text = "Google CEO Sundar Pichai visited Paris last Monday."

for model_name in models:
    print(f"\n=== {model_name} ===")
    ner = pipeline("ner", model=model_name, aggregation_strategy="simple")
    results = ner(text)
    for ent in results:
        print(f"  {ent['word']:15}{ent['entity_group']:6} ({ent['score']:.3f})")

Low-Level Transformer Usage

For more control, use the model directly:

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "dslim/bert-base-NER"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Get label mapping
id2label = model.config.id2label

text = "Apple CEO Tim Cook announced new products in San Francisco."

# Tokenize
inputs = tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
offset_mapping = inputs.pop("offset_mapping")[0]

# Predict
with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=2)[0]

# Decode predictions
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

print("Token-level predictions:")
for token, pred_id, offset in zip(tokens, predictions, offset_mapping):
    if token not in ["[CLS]", "[SEP]", "[PAD]"]:
        label = id2label[pred_id.item()]
        if label != "O":
            print(f"  {token:15}{label}")

Training Custom NER Models

When to Train Custom Models

  • Your entities aren't covered by pre-trained models (e.g., drug names, legal terms)
  • You need higher accuracy for specific entity types
  • You're working in a specialized domain

Training with spaCy

import spacy
from spacy.tokens import DocBin
from spacy.training import Example
import random

# Training data format: (text, {"entities": [(start, end, label), ...]})
TRAIN_DATA = [
    ("iPhone 15 Pro Max has a titanium frame",
     {"entities": [(0, 17, "PRODUCT")]}),
    ("The M3 chip delivers amazing performance",
     {"entities": [(4, 11, "HARDWARE")]}),
    ("macOS Sonoma includes new features",
     {"entities": [(0, 12, "SOFTWARE")]}),
    ("Apple Vision Pro launches next month",
     {"entities": [(0, 16, "PRODUCT")]}),
    ("Download Xcode from the App Store",
     {"entities": [(9, 14, "SOFTWARE"), (24, 33, "PRODUCT")]}),
    ("The A17 Pro chip powers the new iPhone",
     {"entities": [(4, 15, "HARDWARE"), (31, 37, "PRODUCT")]}),
    ("Install Python 3.11 for development",
     {"entities": [(8, 19, "SOFTWARE")]}),
    ("VS Code is a popular editor",
     {"entities": [(0, 7, "SOFTWARE")]}),
    # Add more training examples (typically need 50-200+ per entity type)
]

def train_spacy_ner(train_data, output_dir="custom_ner", n_iter=30):
    """Train a custom spaCy NER model."""

    # Create blank English model
    nlp = spacy.blank("en")

    # Add NER component
    ner = nlp.add_pipe("ner")

    # Add labels
    for _, annotations in train_data:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])

    print(f"Labels: {ner.labels}")

    # Convert to Example objects
    examples = []
    for text, annotations in train_data:
        doc = nlp.make_doc(text)
        example = Example.from_dict(doc, annotations)
        examples.append(example)

    # Initialize the model
    nlp.initialize(lambda: examples)

    # Training loop
    for itn in range(n_iter):
        random.shuffle(examples)
        losses = {}

        for example in examples:
            nlp.update([example], drop=0.35, losses=losses)

        if itn % 10 == 0:
            print(f"Iteration {itn}: Loss = {losses['ner']:.4f}")

    # Save model
    nlp.to_disk(output_dir)
    print(f"\nModel saved to {output_dir}")

    return nlp

# Train
nlp = train_spacy_ner(TRAIN_DATA)

# Test
test_texts = [
    "The iPhone 15 Pro uses the A17 Pro chip",
    "Update your macOS Sonoma to the latest version",
    "Download VS Code and Python 3.12",
]

print("\n=== Testing Custom Model ===")
for text in test_texts:
    doc = nlp(text)
    print(f"\n'{text}'")
    for ent in doc.ents:
        print(f"  {ent.text}{ent.label_}")

Training with Transformers (Fine-tuning BERT)

For higher accuracy, fine-tune a transformer model:

import torch
from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification,
    TrainingArguments,
    Trainer,
    DataCollatorForTokenClassification,
)
from datasets import Dataset
import numpy as np
from seqeval.metrics import classification_report, f1_score

# Define labels (must include O for non-entities)
label_list = ["O", "B-PRODUCT", "I-PRODUCT", "B-HARDWARE", "I-HARDWARE", "B-SOFTWARE", "I-SOFTWARE"]
label2id = {label: i for i, label in enumerate(label_list)}
id2label = {i: label for i, label in enumerate(label_list)}

# Prepare training data
# Format: list of {"tokens": [...], "ner_tags": [...]}
train_data = [
    {
        "tokens": ["iPhone", "15", "Pro", "has", "a", "titanium", "frame"],
        "ner_tags": [label2id["B-PRODUCT"], label2id["I-PRODUCT"], label2id["I-PRODUCT"],
                    label2id["O"], label2id["O"], label2id["O"], label2id["O"]]
    },
    {
        "tokens": ["The", "M3", "chip", "is", "fast"],
        "ner_tags": [label2id["O"], label2id["B-HARDWARE"], label2id["I-HARDWARE"],
                    label2id["O"], label2id["O"]]
    },
    {
        "tokens": ["Install", "Python", "3.11", "now"],
        "ner_tags": [label2id["O"], label2id["B-SOFTWARE"], label2id["I-SOFTWARE"], label2id["O"]]
    },
    # Add more examples (typically need 1000+)
]

# Create dataset
dataset = Dataset.from_list(train_data)

# Load tokenizer and model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_and_align_labels(examples):
    """Tokenize and align labels with subword tokens."""
    tokenized = tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True,
        max_length=128,
    )

    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized.word_ids(batch_index=i)
        label_ids = []
        previous_word_idx = None

        for word_idx in word_ids:
            if word_idx is None:
                # Special tokens get -100 (ignored in loss)
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                # First token of a word gets the label
                label_ids.append(label[word_idx])
            else:
                # Other tokens of the same word get -100
                label_ids.append(-100)
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized["labels"] = labels
    return tokenized

# Tokenize dataset
tokenized_dataset = dataset.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=dataset.column_names
)

# Load model
model = AutoModelForTokenClassification.from_pretrained(
    model_name,
    num_labels=len(label_list),
    id2label=id2label,
    label2id=label2id,
)

# Training arguments
training_args = TrainingArguments(
    output_dir="./ner_model",
    num_train_epochs=10,
    per_device_train_batch_size=8,
    learning_rate=2e-5,
    weight_decay=0.01,
    logging_steps=10,
    save_strategy="epoch",
    report_to="none",
)

# Data collator
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Train (uncomment to run)
# trainer.train()

# Save model
# trainer.save_model("./ner_model_final")

print("Training configuration ready!")
print(f"Labels: {label_list}")
print(f"Number of training examples: {len(train_data)}")

Evaluation Metrics

Understanding NER Metrics

NER evaluation uses entity-level metrics:

  • Precision: Of predicted entities, what % are correct?
  • Recall: Of actual entities, what % did we find?
  • F1 Score: Harmonic mean of precision and recall

Strict vs Partial Matching

  • Strict: Entity boundaries AND type must match exactly
  • Partial: Overlapping entities with correct type count
from seqeval.metrics import classification_report, f1_score, precision_score, recall_score

# Ground truth and predictions (BIO format)
y_true = [
    ["O", "B-PER", "I-PER", "O", "B-ORG", "I-ORG"],
    ["B-LOC", "O", "O", "B-PER", "O"],
]

y_pred = [
    ["O", "B-PER", "I-PER", "O", "B-ORG", "O"],  # Missed I-ORG
    ["B-LOC", "O", "O", "O", "O"],  # Missed B-PER
]

print("=== Classification Report ===")
print(classification_report(y_true, y_pred))

print(f"\nOverall F1: {f1_score(y_true, y_pred):.4f}")
print(f"Precision: {precision_score(y_true, y_pred):.4f}")
print(f"Recall: {recall_score(y_true, y_pred):.4f}")

Evaluating spaCy Models

import spacy
from spacy.training import Example
from spacy.scorer import Scorer

def evaluate_ner(nlp, test_data):
    """Evaluate NER model on test data."""
    examples = []

    for text, annotations in test_data:
        doc = nlp.make_doc(text)
        example = Example.from_dict(doc, annotations)
        examples.append(example)

    scorer = Scorer()
    scores = scorer.score(examples)

    return {
        "precision": scores["ents_p"],
        "recall": scores["ents_r"],
        "f1": scores["ents_f"],
        "per_type": scores["ents_per_type"]
    }

# Test data
TEST_DATA = [
    ("iPhone 15 Pro Max is available now", {"entities": [(0, 17, "PRODUCT")]}),
    ("The M3 chip is revolutionary", {"entities": [(4, 11, "HARDWARE")]}),
]

# Evaluate
nlp = spacy.load("custom_ner")
results = evaluate_ner(nlp, TEST_DATA)
print(f"F1 Score: {results['f1']:.4f}")

Common Challenges and Solutions

1. Nested Entities

Some entities contain other entities:

"Bank of America headquarters"
- "Bank of America" → ORG
- "America" → GPE (nested inside ORG)

Solution: Use spaCy's SpanCategorizer or specialized nested NER models.

2. Ambiguous Entities

Context determines the entity type:

"Apple released new products"  → Apple = ORG
"I ate an apple"               → apple = not an entity

Solution: Transformer models handle context better than rule-based systems.

3. Long Documents

Most models have token limits (512 for BERT):

def process_long_document(text, nlp, max_length=100000):
    """Process long documents in chunks with spaCy."""
    # spaCy handles long docs natively with doc.sents
    doc = nlp(text)
    return list(doc.ents)

def process_long_document_chunks(text, nlp, chunk_size=500, overlap=50):
    """Process with overlapping chunks for transformers."""
    all_entities = []
    start = 0

    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunk = text[start:end]

        doc = nlp(chunk)
        for ent in doc.ents:
            all_entities.append({
                "text": ent.text,
                "label": ent.label_,
                "start": start + ent.start_char,
                "end": start + ent.end_char
            })

        start = end - overlap

    # Remove duplicates from overlap
    # (implement deduplication logic based on your needs)
    return all_entities

4. Domain-Specific Entities

Pre-trained models don't know your domain:

# Medical entities require specialized models
from transformers import pipeline

# Use domain-specific model
bio_ner = pipeline("ner", model="d4data/biomedical-ner-all", aggregation_strategy="simple")

text = "The patient was prescribed 500mg Metformin for Type 2 Diabetes"
entities = bio_ner(text)
for ent in entities:
    print(f"{ent['word']}{ent['entity_group']}")

Production Deployment

Complete NER Service

import spacy
from typing import List, Dict
from dataclasses import dataclass
from functools import lru_cache

@dataclass
class Entity:
    text: str
    label: str
    start: int
    end: int
    confidence: float = 1.0

class NERService:
    """Production-ready NER service."""

    def __init__(self, model_name: str = "en_core_web_trf"):
        self.nlp = spacy.load(model_name)
        print(f"Loaded model: {model_name}")
        print(f"Pipeline: {self.nlp.pipe_names}")

    def extract_entities(self, text: str) -> List[Entity]:
        """Extract entities from text."""
        doc = self.nlp(text)
        return [
            Entity(
                text=ent.text,
                label=ent.label_,
                start=ent.start_char,
                end=ent.end_char,
            )
            for ent in doc.ents
        ]

    def extract_batch(self, texts: List[str], batch_size: int = 50) -> List[List[Entity]]:
        """Process multiple texts efficiently."""
        results = []
        for doc in self.nlp.pipe(texts, batch_size=batch_size):
            entities = [
                Entity(text=ent.text, label=ent.label_, start=ent.start_char, end=ent.end_char)
                for ent in doc.ents
            ]
            results.append(entities)
        return results

    def get_entities_by_type(self, text: str, entity_types: List[str]) -> List[Entity]:
        """Extract only specific entity types."""
        all_entities = self.extract_entities(text)
        return [ent for ent in all_entities if ent.label in entity_types]

# Usage
ner = NERService("en_core_web_sm")

# Single text
entities = ner.extract_entities("Apple CEO Tim Cook visited Paris")
for ent in entities:
    print(f"{ent.text}{ent.label}")

# Batch processing
texts = ["Microsoft is in Seattle", "Google is in Mountain View"]
batch_results = ner.extract_batch(texts)
for i, entities in enumerate(batch_results):
    print(f"\nText {i+1}:")
    for ent in entities:
        print(f"  {ent.text}{ent.label}")

# Filter by type
people = ner.get_entities_by_type("Tim Cook met Satya Nadella", ["PERSON"])
print(f"\nPeople: {[p.text for p in people]}")

Conclusion

NER is a well-solved problem for common entity types, but domain-specific applications require custom training:

  • spaCy: Best for production - fast, easy to use, good accuracy
  • Hugging Face Transformers: Best accuracy, especially for difficult cases
  • Custom Training: Essential for domain-specific entities
  • Evaluation: Always use proper entity-level metrics (seqeval)

Key takeaways:

  1. Start with pre-trained models (spaCy or HuggingFace)
  2. Add custom rules for known patterns (EntityRuler)
  3. Fine-tune when you need domain-specific entities
  4. Evaluate properly with entity-level F1 scores

References