Introduction

Prompt engineering is the practice of crafting inputs that reliably produce the outputs you need from Large Language Models. As LLMs become central to applications ranging from chatbots to code generation, the ability to communicate effectively with these models becomes a valuable skill.

The difference between a mediocre and an excellent prompt can mean the difference between:

Inconsistent results vs. reliable, reproducible outputs
Multiple retry attempts vs. first-time success
Unparseable text vs. structured data ready for your application

This guide covers the fundamental techniques that work across all major LLMs.

The Core Techniques

Understanding the Spectrum

Prompt engineering techniques exist on a spectrum from simple to complex:

Technique	When to Use	Token Cost	Best For
Zero-Shot	Clear, simple tasks	Low	Classification, extraction
Few-Shot	Custom formats or categories	Medium	Domain-specific tasks
Chain-of-Thought	Multi-step reasoning	Higher	Math, logic, analysis
Structured Output	Application integration	Medium	APIs, data pipelines

The key insight: start simple and add complexity only when needed. Zero-shot prompts work surprisingly well for many tasks, and you should only escalate to more sophisticated techniques when simpler approaches fail.

Zero-Shot Prompting

Zero-shot prompting asks the model to perform a task without providing any examples. The model relies entirely on its training to understand what you want.

When Zero-Shot Works Well

Zero-shot is effective when:

The task is unambiguous (sentiment classification, summarization)
The output format is natural (text, simple labels)
The domain is general (not industry-specific jargon)

Anatomy of a Good Zero-Shot Prompt

A well-structured prompt includes:

[Role/Context] - Who is the AI?
[Task] - What should it do?
[Input] - What to process?
[Format] - How to structure output?
[Constraints] - What to avoid?

Example:

prompt = """You are a customer service classifier.

Classify the following customer message into exactly one category:
- billing
- technical
- shipping
- general

Customer message: "I was charged twice for my subscription"

Respond with only the category name, nothing else."""

The explicit constraint ("Respond with only the category name") prevents verbose explanations that make parsing difficult.

Common Zero-Shot Mistakes

Too vague: "Summarize this" → Better: "Summarize in 3 bullet points under 15 words each"
No format guidance: "Extract the dates" → Better: "Extract dates as ISO format: YYYY-MM-DD"
Ambiguous scope: "Fix the code" → Better: "Fix the IndexError on line 15 and explain the cause"

Few-Shot Learning

When zero-shot produces inconsistent results, few-shot learning provides examples that demonstrate the desired behavior. The model learns the pattern from your examples and applies it to new inputs.

How Examples Guide the Model

Few-shot works through pattern recognition. The model identifies:

Input structure: What kind of data am I receiving?
Output format: How should I structure my response?
Decision logic: What reasoning connects input to output?

Example: Custom Classification

Few-shot examples work best when formatted as conversation turns. Each user message is an input, and each assistant message demonstrates the expected output format. The model learns to continue the pattern when it sees a new input without a corresponding assistant response.

messages = [
    {"role": "system", "content": "Classify support tickets."},
    {"role": "user", "content": "My payment failed three times"},
    {"role": "assistant", "content": "billing"},
    {"role": "user", "content": "The app crashes on startup"},
    {"role": "assistant", "content": "technical"},
    {"role": "user", "content": "Package hasn't arrived in 2 weeks"},
    {"role": "assistant", "content": "shipping"},
    {"role": "user", "content": "I was charged in wrong currency"}
]
# Model learns pattern → outputs "billing"

How Many Examples?

Task Complexity	Recommended	Why
Simple classification	2-3	Pattern is obvious
Custom categories	3-5	Need to show boundaries
Complex reasoning	4-6	Multiple steps to demonstrate
Creative/style	1-2	Showing tone, not logic

More examples aren't always better—they consume tokens and can cause the model to overfit to specific patterns rather than generalizing.

Example Selection Matters

Choose examples that:

Cover edge cases: Include borderline cases that define category boundaries
Are diverse: Don't repeat similar examples
Match expected inputs: Use realistic data similar to production
Demonstrate the hard cases: Easy cases don't teach the model much

Chain-of-Thought Prompting

Chain-of-thought (CoT) prompting encourages the model to reason step-by-step before providing an answer. This dramatically improves performance on tasks requiring:

Mathematical calculations
Logical reasoning
Multi-step analysis
Complex decision-making

Why Step-by-Step Reasoning Helps

LLMs generate text token by token. Without explicit reasoning:

The model might jump to a conclusion before considering all factors
Errors in early reasoning steps compound without correction
The model can't "backtrack" once tokens are generated

Chain-of-thought forces the model to show its work, which:

Surfaces reasoning errors that can be caught
Breaks complex problems into manageable steps
Grounds the final answer in explicit logic

Zero-Shot CoT: The Magic Phrase

Simply adding "Let's think step by step" to your prompt triggers reasoning:

# Without CoT - often fails on complex math
prompt = "If a store sells 15% of 80 items on Monday and 20% of the
remaining items on Tuesday, how many items are left?"

# With CoT - much higher accuracy
prompt = """If a store sells 15% of 80 items on Monday and 20% of the
remaining items on Tuesday, how many items are left?

Let's think step by step."""

The model will then work through:

15% of 80 = 12 items sold Monday
80 - 12 = 68 items remaining
20% of 68 = 13.6 ≈ 14 items sold Tuesday
68 - 14 = 54 items left

When to Use Chain-of-Thought

Use CoT	Avoid CoT
Math problems	Simple classification
Logic puzzles	Direct extraction
Multi-step analysis	Summarization
Decisions with tradeoffs	Translation
Debugging/troubleshooting	Formatting tasks

CoT adds latency and cost. For simple tasks, it's unnecessary overhead.

Structured Output

For applications that consume LLM outputs, unstructured text is problematic. Structured output techniques ensure responses follow a predictable format.

JSON Mode

Most modern APIs support JSON mode, which constrains the model to output syntactically valid JSON. This prevents common issues like markdown code blocks, explanatory text, or malformed JSON that would break your parser. You still need to include the expected schema in your prompt—JSON mode ensures valid syntax, not schema compliance.

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": """Extract entities from text. Return JSON:
{
    "people": ["name1", "name2"],
    "organizations": ["org1"],
    "locations": ["loc1"],
    "dates": ["YYYY-MM-DD"]
}"""
        },
        {"role": "user", "content": text}
    ],
    response_format={"type": "json_object"}
)

Schema Enforcement with Pydantic

For type safety and validation, define schemas that the model must follow using Pydantic. By including the JSON schema in your prompt, you tell the model exactly what fields to include and their expected types. After receiving the response, Pydantic validates the parsed JSON against your schema, catching type mismatches and missing fields before they cause downstream errors.

from pydantic import BaseModel
from typing import List

class ExtractedData(BaseModel):
    sentiment: str  # positive, negative, neutral
    confidence: float  # 0.0 to 1.0
    key_topics: List[str]
    summary: str

# Include schema in prompt
schema_json = ExtractedData.model_json_schema()

This gives you:

Automatic validation of LLM output
Type hints in your IDE
Clear documentation of expected format

Best Practices for Structured Output

Always provide the schema in the prompt - Don't assume the model knows your format
Use simple types - Arrays, objects, strings, numbers work reliably
Include example output - Show exactly what valid JSON looks like
Validate before using - Even with JSON mode, validate against your schema

Advanced Techniques

Self-Consistency

For high-stakes decisions, generate multiple responses and aggregate:

Run the same prompt 3-5 times with temperature > 0
Extract the final answer from each response
Take the majority vote

This reduces single-run errors and provides a confidence signal (how often did answers agree?).

Prompt Chaining

Complex tasks often work better as a sequence of simpler prompts:

Task: Research report on a topic

Chain:
1. Generate research questions → questions
2. Answer each question → raw_findings
3. Identify themes → themes
4. Synthesize into report → final_report

Each step has a focused task, making debugging easier and quality higher.

Role Prompting

Assigning a specific persona influences the model's vocabulary, depth, and perspective:

personas = {
    "expert": "You are a senior software architect with 20 years experience.",
    "beginner": "You are explaining to someone new to programming.",
    "skeptic": "You are a critical reviewer looking for flaws.",
}

Role prompting is especially useful for:

Technical depth (expert roles)
Accessibility (teacher roles)
Quality assurance (reviewer roles)

Common Patterns

The RISEN Framework

A structured approach to prompt construction:

Role: Who is the AI?
Instructions: What should it do?
Situation: What's the context?
Examples: Demonstrations of desired behavior
Narrowing: Constraints and format

Temperature and Sampling

Use Case	Temperature	Why
Code generation	0.0-0.2	Determinism matters
Factual Q&A	0.0-0.3	Accuracy over creativity
Creative writing	0.7-1.0	Variety and novelty
Brainstorming	0.8-1.2	Maximum divergence

For reproducible results, use temperature=0 and set a seed parameter if available.

Negative Constraints

Telling the model what NOT to do is often as important as what to do:

constraints = """
- Do NOT include code examples
- Do NOT use bullet points
- Do NOT exceed 100 words
- Do NOT use technical jargon
"""

Negative constraints prevent common failure modes and keep outputs focused.

Testing and Iteration

Treat Prompts Like Code

Good prompt engineering practices:

Version control - Track prompt changes over time
Test suites - Define expected outputs for given inputs
Regression testing - Ensure changes don't break existing cases
A/B testing - Compare prompt variations on real data

Evaluation Metrics

How to measure prompt quality:

Metric	Measures	How to Calculate
Accuracy	Correctness	% matching expected output
Consistency	Reliability	Variance across multiple runs
Format compliance	Parseability	% valid JSON/schema matches
Latency	Speed	Response time in ms
Cost	Efficiency	Tokens used per request

Iterative Improvement

When a prompt isn't working:

Identify failure mode - What specifically is wrong?
Add constraints - Explicitly forbid the bad behavior
Add examples - Show the correct behavior
Simplify - Maybe the task is too complex for one prompt
Escalate - Try a more capable model

Conclusion

Effective prompt engineering comes down to clear communication:

Be specific - Vague prompts produce vague results
Start simple - Only add complexity when needed
Show examples - Few-shot learning is surprisingly powerful
Request structure - JSON mode enables reliable parsing
Encourage reasoning - Chain-of-thought improves accuracy on hard problems
Test systematically - Treat prompts as code that needs testing

The best prompts evolve through experimentation. Start with a simple approach, measure what's working, and iterate.

References

Wei, J., et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models". NeurIPS 2022.
Wang, X., et al. (2023). "Self-Consistency Improves Chain of Thought Reasoning". ICLR 2023.
OpenAI Prompt Engineering Guide - Official best practices.
Anthropic Prompt Engineering - Claude-specific guidance.
Learn Prompting - Comprehensive community resource.