- Published on
Named Entity Recognition (NER): From Rule-Based to Transformer Models
- Authors

- Name
- Jared Chung
Introduction
Named Entity Recognition (NER) is a fundamental NLP task that identifies and classifies named entities in text into predefined categories such as person names, organizations, locations, dates, and more. NER is the foundation of information extraction pipelines and powers features like:
- Extracting contacts from emails
- Identifying companies mentioned in news articles
- Parsing resumes for candidate information
- Anonymizing sensitive data (PII detection)
- Building knowledge graphs
In this comprehensive guide, we'll explore NER from basic concepts to production implementations.
Prerequisites
# Core packages
pip install spacy transformers torch datasets seqeval
# Download spaCy models
python -m spacy download en_core_web_sm
python -m spacy download en_core_web_trf # Transformer-based (better but slower)
Understanding NER
What Does NER Do?
NER takes unstructured text and extracts structured information:
Input: "Apple Inc. was founded by Steve Jobs in Cupertino, California in 1976."
Output:
- Apple Inc. → ORGANIZATION (ORG)
- Steve Jobs → PERSON (PER)
- Cupertino → GEOPOLITICAL ENTITY (GPE)
- California → GEOPOLITICAL ENTITY (GPE)
- 1976 → DATE
Standard Entity Types
Different datasets define different entity types. Here are the most common:
OntoNotes 5.0 (spaCy default)
| Type | Description | Examples |
|---|---|---|
| PERSON | People, including fictional | Barack Obama, Sherlock Holmes |
| ORG | Companies, agencies, institutions | Google, FBI, Stanford University |
| GPE | Countries, cities, states | France, New York, California |
| LOC | Non-GPE locations | Mount Everest, Pacific Ocean |
| DATE | Absolute or relative dates | June 2023, yesterday, next week |
| TIME | Times smaller than a day | 3:00 PM, morning |
| MONEY | Monetary values | $500, fifty euros |
| PERCENT | Percentages | 25%, three percent |
| CARDINAL | Numerals not covered by other types | one, 1000, dozens |
| ORDINAL | First, second, etc. | first, 3rd |
| PRODUCT | Objects, vehicles, foods (not services) | iPhone, Boeing 747 |
| EVENT | Named events | World War II, Olympics |
| WORK_OF_ART | Titles of books, songs, etc. | Harry Potter, Let It Be |
| LAW | Named documents made into laws | Roe v. Wade, GDPR |
| LANGUAGE | Named languages | English, Spanish |
CoNLL-2003 (Common benchmark)
| Type | Description |
|---|---|
| PER | Person names |
| ORG | Organizations |
| LOC | Locations |
| MISC | Miscellaneous entities |
BIO Tagging Scheme
Internally, NER uses token-level labeling. The BIO scheme marks:
- B-XXX: Beginning of entity type XXX
- I-XXX: Inside (continuation) of entity type XXX
- O: Outside any entity
Token: Apple Inc. was founded by Steve Jobs in Cupertino
Label: B-ORG I-ORG O O O B-PER I-PER O B-GPE
This handles multi-word entities correctly.
spaCy: Production-Ready NER
spaCy is the go-to library for production NLP. It's fast, accurate, and easy to use.
Basic Usage
import spacy
from spacy import displacy
# Load the small English model (faster, less accurate)
nlp = spacy.load("en_core_web_sm")
# Or load the transformer model (slower, more accurate)
# nlp = spacy.load("en_core_web_trf")
text = """
Elon Musk, CEO of Tesla and SpaceX, announced on Monday that the company
will open a new Gigafactory in Berlin, Germany. The $5 billion investment
was confirmed during a press conference at the Tesla headquarters in
Palo Alto, California.
"""
# Process the text
doc = nlp(text)
# Extract entities
print("=== Named Entities ===")
for ent in doc.ents:
print(f"{ent.text:20} | {ent.label_:12} | {ent.start_char:4}-{ent.end_char:4}")
Output:
Elon Musk | PERSON | 1-10
Tesla | ORG | 19-24
SpaceX | ORG | 29-35
Monday | DATE | 50-56
Gigafactory | ORG | 97-108
Berlin | GPE | 112-118
Germany | GPE | 120-127
$5 billion | MONEY | 133-143
Tesla | ORG | 202-207
Palo Alto | GPE | 224-233
California | GPE | 235-245
Visualizing Entities
from spacy import displacy
# In Jupyter notebook
displacy.render(doc, style="ent", jupyter=True)
# Generate HTML file
html = displacy.render(doc, style="ent", page=True)
with open("entities.html", "w", encoding="utf-8") as f:
f.write(html)
print("Saved visualization to entities.html")
# Customize colors
colors = {"ORG": "#ff6b6b", "PERSON": "#4ecdc4", "GPE": "#45b7d1"}
options = {"colors": colors}
html = displacy.render(doc, style="ent", page=True, options=options)
Accessing Entity Details
for ent in doc.ents:
print(f"Text: {ent.text}")
print(f" Label: {ent.label_} ({spacy.explain(ent.label_)})")
print(f" Start token: {ent.start}, End token: {ent.end}")
print(f" Start char: {ent.start_char}, End char: {ent.end_char}")
print(f" Root: {ent.root.text}")
print()
Custom Rules with EntityRuler
Add pattern-based entity recognition for domain-specific terms:
import spacy
from spacy.pipeline import EntityRuler
nlp = spacy.load("en_core_web_sm")
# Add entity ruler BEFORE the NER component
ruler = nlp.add_pipe("entity_ruler", before="ner")
# Define patterns
patterns = [
# Exact match
{"label": "PRODUCT", "pattern": "iPhone 15 Pro"},
{"label": "PRODUCT", "pattern": "MacBook Pro"},
# Token-based patterns (more flexible)
{"label": "PRODUCT", "pattern": [{"LOWER": "iphone"}, {"IS_DIGIT": True}]},
{"label": "TECH", "pattern": [{"LOWER": {"IN": ["pytorch", "tensorflow", "keras", "scikit-learn"]}}]},
# Regex-like patterns
{"label": "VERSION", "pattern": [{"TEXT": {"REGEX": "v\\d+\\.\\d+(\\.\\d+)?"}}]},
# Complex patterns
{"label": "PROGRAMMING_LANG", "pattern": [
{"LOWER": {"IN": ["python", "javascript", "java", "rust", "go", "c++"]}},
]},
]
ruler.add_patterns(patterns)
# Test
test_texts = [
"I use PyTorch for deep learning on my MacBook Pro",
"The latest iPhone 15 Pro has a titanium frame",
"We upgraded to Python 3.11 and TensorFlow 2.15",
"The API is now at v2.3.1",
]
for text in test_texts:
doc = nlp(text)
print(f"\n'{text}'")
for ent in doc.ents:
print(f" {ent.text} → {ent.label_}")
Combining Rules and ML
The EntityRuler can work alongside the statistical NER:
# Priority: EntityRuler patterns override statistical NER when they overlap
nlp = spacy.load("en_core_web_sm")
# Add AFTER ner to let patterns override
ruler = nlp.add_pipe("entity_ruler", after="ner")
# Or add BEFORE ner to let statistical model override
# ruler = nlp.add_pipe("entity_ruler", before="ner")
patterns = [
{"label": "COMPANY", "pattern": "Apple"}, # Override default ORG
]
ruler.add_patterns(patterns)
doc = nlp("Apple announced new products")
for ent in doc.ents:
print(f"{ent.text} → {ent.label_}")
Transformer-Based NER with Hugging Face
For state-of-the-art accuracy, use transformer models.
Using Pre-trained NER Models
from transformers import pipeline
# Load NER pipeline with a fine-tuned model
ner_pipeline = pipeline(
"ner",
model="dslim/bert-base-NER",
aggregation_strategy="simple" # Merge B-/I- tokens
)
text = """
Microsoft CEO Satya Nadella announced a $10 billion investment in OpenAI.
The partnership was revealed at their Redmond, Washington headquarters.
"""
# Run NER
entities = ner_pipeline(text)
print("=== Transformer NER Results ===")
for entity in entities:
print(f"{entity['word']:20} | {entity['entity_group']:10} | {entity['score']:.4f}")
Output:
Microsoft | ORG | 0.9985
Satya Nadella | PER | 0.9991
OpenAI | ORG | 0.9972
Redmond | LOC | 0.9988
Washington | LOC | 0.9976
Comparing Different Models
from transformers import pipeline
models = [
"dslim/bert-base-NER",
"Jean-Baptiste/roberta-large-ner-english",
"xlm-roberta-large-finetuned-conll03-english",
]
text = "Google CEO Sundar Pichai visited Paris last Monday."
for model_name in models:
print(f"\n=== {model_name} ===")
ner = pipeline("ner", model=model_name, aggregation_strategy="simple")
results = ner(text)
for ent in results:
print(f" {ent['word']:15} → {ent['entity_group']:6} ({ent['score']:.3f})")
Low-Level Transformer Usage
For more control, use the model directly:
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_name = "dslim/bert-base-NER"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Get label mapping
id2label = model.config.id2label
text = "Apple CEO Tim Cook announced new products in San Francisco."
# Tokenize
inputs = tokenizer(text, return_tensors="pt", return_offsets_mapping=True)
offset_mapping = inputs.pop("offset_mapping")[0]
# Predict
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)[0]
# Decode predictions
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
print("Token-level predictions:")
for token, pred_id, offset in zip(tokens, predictions, offset_mapping):
if token not in ["[CLS]", "[SEP]", "[PAD]"]:
label = id2label[pred_id.item()]
if label != "O":
print(f" {token:15} → {label}")
Training Custom NER Models
When to Train Custom Models
- Your entities aren't covered by pre-trained models (e.g., drug names, legal terms)
- You need higher accuracy for specific entity types
- You're working in a specialized domain
Training with spaCy
import spacy
from spacy.tokens import DocBin
from spacy.training import Example
import random
# Training data format: (text, {"entities": [(start, end, label), ...]})
TRAIN_DATA = [
("iPhone 15 Pro Max has a titanium frame",
{"entities": [(0, 17, "PRODUCT")]}),
("The M3 chip delivers amazing performance",
{"entities": [(4, 11, "HARDWARE")]}),
("macOS Sonoma includes new features",
{"entities": [(0, 12, "SOFTWARE")]}),
("Apple Vision Pro launches next month",
{"entities": [(0, 16, "PRODUCT")]}),
("Download Xcode from the App Store",
{"entities": [(9, 14, "SOFTWARE"), (24, 33, "PRODUCT")]}),
("The A17 Pro chip powers the new iPhone",
{"entities": [(4, 15, "HARDWARE"), (31, 37, "PRODUCT")]}),
("Install Python 3.11 for development",
{"entities": [(8, 19, "SOFTWARE")]}),
("VS Code is a popular editor",
{"entities": [(0, 7, "SOFTWARE")]}),
# Add more training examples (typically need 50-200+ per entity type)
]
def train_spacy_ner(train_data, output_dir="custom_ner", n_iter=30):
"""Train a custom spaCy NER model."""
# Create blank English model
nlp = spacy.blank("en")
# Add NER component
ner = nlp.add_pipe("ner")
# Add labels
for _, annotations in train_data:
for ent in annotations.get("entities"):
ner.add_label(ent[2])
print(f"Labels: {ner.labels}")
# Convert to Example objects
examples = []
for text, annotations in train_data:
doc = nlp.make_doc(text)
example = Example.from_dict(doc, annotations)
examples.append(example)
# Initialize the model
nlp.initialize(lambda: examples)
# Training loop
for itn in range(n_iter):
random.shuffle(examples)
losses = {}
for example in examples:
nlp.update([example], drop=0.35, losses=losses)
if itn % 10 == 0:
print(f"Iteration {itn}: Loss = {losses['ner']:.4f}")
# Save model
nlp.to_disk(output_dir)
print(f"\nModel saved to {output_dir}")
return nlp
# Train
nlp = train_spacy_ner(TRAIN_DATA)
# Test
test_texts = [
"The iPhone 15 Pro uses the A17 Pro chip",
"Update your macOS Sonoma to the latest version",
"Download VS Code and Python 3.12",
]
print("\n=== Testing Custom Model ===")
for text in test_texts:
doc = nlp(text)
print(f"\n'{text}'")
for ent in doc.ents:
print(f" {ent.text} → {ent.label_}")
Training with Transformers (Fine-tuning BERT)
For higher accuracy, fine-tune a transformer model:
import torch
from transformers import (
AutoTokenizer,
AutoModelForTokenClassification,
TrainingArguments,
Trainer,
DataCollatorForTokenClassification,
)
from datasets import Dataset
import numpy as np
from seqeval.metrics import classification_report, f1_score
# Define labels (must include O for non-entities)
label_list = ["O", "B-PRODUCT", "I-PRODUCT", "B-HARDWARE", "I-HARDWARE", "B-SOFTWARE", "I-SOFTWARE"]
label2id = {label: i for i, label in enumerate(label_list)}
id2label = {i: label for i, label in enumerate(label_list)}
# Prepare training data
# Format: list of {"tokens": [...], "ner_tags": [...]}
train_data = [
{
"tokens": ["iPhone", "15", "Pro", "has", "a", "titanium", "frame"],
"ner_tags": [label2id["B-PRODUCT"], label2id["I-PRODUCT"], label2id["I-PRODUCT"],
label2id["O"], label2id["O"], label2id["O"], label2id["O"]]
},
{
"tokens": ["The", "M3", "chip", "is", "fast"],
"ner_tags": [label2id["O"], label2id["B-HARDWARE"], label2id["I-HARDWARE"],
label2id["O"], label2id["O"]]
},
{
"tokens": ["Install", "Python", "3.11", "now"],
"ner_tags": [label2id["O"], label2id["B-SOFTWARE"], label2id["I-SOFTWARE"], label2id["O"]]
},
# Add more examples (typically need 1000+)
]
# Create dataset
dataset = Dataset.from_list(train_data)
# Load tokenizer and model
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
def tokenize_and_align_labels(examples):
"""Tokenize and align labels with subword tokens."""
tokenized = tokenizer(
examples["tokens"],
truncation=True,
is_split_into_words=True,
max_length=128,
)
labels = []
for i, label in enumerate(examples["ner_tags"]):
word_ids = tokenized.word_ids(batch_index=i)
label_ids = []
previous_word_idx = None
for word_idx in word_ids:
if word_idx is None:
# Special tokens get -100 (ignored in loss)
label_ids.append(-100)
elif word_idx != previous_word_idx:
# First token of a word gets the label
label_ids.append(label[word_idx])
else:
# Other tokens of the same word get -100
label_ids.append(-100)
previous_word_idx = word_idx
labels.append(label_ids)
tokenized["labels"] = labels
return tokenized
# Tokenize dataset
tokenized_dataset = dataset.map(
tokenize_and_align_labels,
batched=True,
remove_columns=dataset.column_names
)
# Load model
model = AutoModelForTokenClassification.from_pretrained(
model_name,
num_labels=len(label_list),
id2label=id2label,
label2id=label2id,
)
# Training arguments
training_args = TrainingArguments(
output_dir="./ner_model",
num_train_epochs=10,
per_device_train_batch_size=8,
learning_rate=2e-5,
weight_decay=0.01,
logging_steps=10,
save_strategy="epoch",
report_to="none",
)
# Data collator
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
tokenizer=tokenizer,
data_collator=data_collator,
)
# Train (uncomment to run)
# trainer.train()
# Save model
# trainer.save_model("./ner_model_final")
print("Training configuration ready!")
print(f"Labels: {label_list}")
print(f"Number of training examples: {len(train_data)}")
Evaluation Metrics
Understanding NER Metrics
NER evaluation uses entity-level metrics:
- Precision: Of predicted entities, what % are correct?
- Recall: Of actual entities, what % did we find?
- F1 Score: Harmonic mean of precision and recall
Strict vs Partial Matching
- Strict: Entity boundaries AND type must match exactly
- Partial: Overlapping entities with correct type count
from seqeval.metrics import classification_report, f1_score, precision_score, recall_score
# Ground truth and predictions (BIO format)
y_true = [
["O", "B-PER", "I-PER", "O", "B-ORG", "I-ORG"],
["B-LOC", "O", "O", "B-PER", "O"],
]
y_pred = [
["O", "B-PER", "I-PER", "O", "B-ORG", "O"], # Missed I-ORG
["B-LOC", "O", "O", "O", "O"], # Missed B-PER
]
print("=== Classification Report ===")
print(classification_report(y_true, y_pred))
print(f"\nOverall F1: {f1_score(y_true, y_pred):.4f}")
print(f"Precision: {precision_score(y_true, y_pred):.4f}")
print(f"Recall: {recall_score(y_true, y_pred):.4f}")
Evaluating spaCy Models
import spacy
from spacy.training import Example
from spacy.scorer import Scorer
def evaluate_ner(nlp, test_data):
"""Evaluate NER model on test data."""
examples = []
for text, annotations in test_data:
doc = nlp.make_doc(text)
example = Example.from_dict(doc, annotations)
examples.append(example)
scorer = Scorer()
scores = scorer.score(examples)
return {
"precision": scores["ents_p"],
"recall": scores["ents_r"],
"f1": scores["ents_f"],
"per_type": scores["ents_per_type"]
}
# Test data
TEST_DATA = [
("iPhone 15 Pro Max is available now", {"entities": [(0, 17, "PRODUCT")]}),
("The M3 chip is revolutionary", {"entities": [(4, 11, "HARDWARE")]}),
]
# Evaluate
nlp = spacy.load("custom_ner")
results = evaluate_ner(nlp, TEST_DATA)
print(f"F1 Score: {results['f1']:.4f}")
Common Challenges and Solutions
1. Nested Entities
Some entities contain other entities:
"Bank of America headquarters"
- "Bank of America" → ORG
- "America" → GPE (nested inside ORG)
Solution: Use spaCy's SpanCategorizer or specialized nested NER models.
2. Ambiguous Entities
Context determines the entity type:
"Apple released new products" → Apple = ORG
"I ate an apple" → apple = not an entity
Solution: Transformer models handle context better than rule-based systems.
3. Long Documents
Most models have token limits (512 for BERT):
def process_long_document(text, nlp, max_length=100000):
"""Process long documents in chunks with spaCy."""
# spaCy handles long docs natively with doc.sents
doc = nlp(text)
return list(doc.ents)
def process_long_document_chunks(text, nlp, chunk_size=500, overlap=50):
"""Process with overlapping chunks for transformers."""
all_entities = []
start = 0
while start < len(text):
end = min(start + chunk_size, len(text))
chunk = text[start:end]
doc = nlp(chunk)
for ent in doc.ents:
all_entities.append({
"text": ent.text,
"label": ent.label_,
"start": start + ent.start_char,
"end": start + ent.end_char
})
start = end - overlap
# Remove duplicates from overlap
# (implement deduplication logic based on your needs)
return all_entities
4. Domain-Specific Entities
Pre-trained models don't know your domain:
# Medical entities require specialized models
from transformers import pipeline
# Use domain-specific model
bio_ner = pipeline("ner", model="d4data/biomedical-ner-all", aggregation_strategy="simple")
text = "The patient was prescribed 500mg Metformin for Type 2 Diabetes"
entities = bio_ner(text)
for ent in entities:
print(f"{ent['word']} → {ent['entity_group']}")
Production Deployment
Complete NER Service
import spacy
from typing import List, Dict
from dataclasses import dataclass
from functools import lru_cache
@dataclass
class Entity:
text: str
label: str
start: int
end: int
confidence: float = 1.0
class NERService:
"""Production-ready NER service."""
def __init__(self, model_name: str = "en_core_web_trf"):
self.nlp = spacy.load(model_name)
print(f"Loaded model: {model_name}")
print(f"Pipeline: {self.nlp.pipe_names}")
def extract_entities(self, text: str) -> List[Entity]:
"""Extract entities from text."""
doc = self.nlp(text)
return [
Entity(
text=ent.text,
label=ent.label_,
start=ent.start_char,
end=ent.end_char,
)
for ent in doc.ents
]
def extract_batch(self, texts: List[str], batch_size: int = 50) -> List[List[Entity]]:
"""Process multiple texts efficiently."""
results = []
for doc in self.nlp.pipe(texts, batch_size=batch_size):
entities = [
Entity(text=ent.text, label=ent.label_, start=ent.start_char, end=ent.end_char)
for ent in doc.ents
]
results.append(entities)
return results
def get_entities_by_type(self, text: str, entity_types: List[str]) -> List[Entity]:
"""Extract only specific entity types."""
all_entities = self.extract_entities(text)
return [ent for ent in all_entities if ent.label in entity_types]
# Usage
ner = NERService("en_core_web_sm")
# Single text
entities = ner.extract_entities("Apple CEO Tim Cook visited Paris")
for ent in entities:
print(f"{ent.text} → {ent.label}")
# Batch processing
texts = ["Microsoft is in Seattle", "Google is in Mountain View"]
batch_results = ner.extract_batch(texts)
for i, entities in enumerate(batch_results):
print(f"\nText {i+1}:")
for ent in entities:
print(f" {ent.text} → {ent.label}")
# Filter by type
people = ner.get_entities_by_type("Tim Cook met Satya Nadella", ["PERSON"])
print(f"\nPeople: {[p.text for p in people]}")
Conclusion
NER is a well-solved problem for common entity types, but domain-specific applications require custom training:
- spaCy: Best for production - fast, easy to use, good accuracy
- Hugging Face Transformers: Best accuracy, especially for difficult cases
- Custom Training: Essential for domain-specific entities
- Evaluation: Always use proper entity-level metrics (seqeval)
Key takeaways:
- Start with pre-trained models (spaCy or HuggingFace)
- Add custom rules for known patterns (EntityRuler)
- Fine-tune when you need domain-specific entities
- Evaluate properly with entity-level F1 scores
References
- spaCy Documentation: https://spacy.io/usage/linguistic-features#named-entities
- Hugging Face NER: https://huggingface.co/tasks/token-classification
- Honnibal & Montani "spaCy 2: Natural language understanding with Bloom embeddings" (2017)
- Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers" (2019)
- seqeval: https://github.com/chakki-works/seqeval