Jared AI Hub
Published on

Prompt Engineering: Getting Better Results from LLMs

Authors
  • avatar
    Name
    Jared Chung
    Twitter

Introduction

Prompt engineering is the practice of crafting inputs that reliably produce the outputs you need from Large Language Models. As LLMs become central to applications ranging from chatbots to code generation, the ability to communicate effectively with these models becomes a valuable skill.

The difference between a mediocre and an excellent prompt can mean the difference between:

  • Inconsistent results vs. reliable, reproducible outputs
  • Multiple retry attempts vs. first-time success
  • Unparseable text vs. structured data ready for your application

This guide covers the fundamental techniques that work across all major LLMs.

The Core Techniques

Prompt Engineering Techniques

Understanding the Spectrum

Prompt engineering techniques exist on a spectrum from simple to complex:

TechniqueWhen to UseToken CostBest For
Zero-ShotClear, simple tasksLowClassification, extraction
Few-ShotCustom formats or categoriesMediumDomain-specific tasks
Chain-of-ThoughtMulti-step reasoningHigherMath, logic, analysis
Structured OutputApplication integrationMediumAPIs, data pipelines

The key insight: start simple and add complexity only when needed. Zero-shot prompts work surprisingly well for many tasks, and you should only escalate to more sophisticated techniques when simpler approaches fail.

Zero-Shot Prompting

Zero-shot prompting asks the model to perform a task without providing any examples. The model relies entirely on its training to understand what you want.

When Zero-Shot Works Well

Zero-shot is effective when:

  1. The task is unambiguous (sentiment classification, summarization)
  2. The output format is natural (text, simple labels)
  3. The domain is general (not industry-specific jargon)

Anatomy of a Good Zero-Shot Prompt

A well-structured prompt includes:

[Role/Context] - Who is the AI?
[Task] - What should it do?
[Input] - What to process?
[Format] - How to structure output?
[Constraints] - What to avoid?

Example:

prompt = """You are a customer service classifier.

Classify the following customer message into exactly one category:
- billing
- technical
- shipping
- general

Customer message: "I was charged twice for my subscription"

Respond with only the category name, nothing else."""

The explicit constraint ("Respond with only the category name") prevents verbose explanations that make parsing difficult.

Common Zero-Shot Mistakes

  1. Too vague: "Summarize this" → Better: "Summarize in 3 bullet points under 15 words each"
  2. No format guidance: "Extract the dates" → Better: "Extract dates as ISO format: YYYY-MM-DD"
  3. Ambiguous scope: "Fix the code" → Better: "Fix the IndexError on line 15 and explain the cause"

Few-Shot Learning

When zero-shot produces inconsistent results, few-shot learning provides examples that demonstrate the desired behavior. The model learns the pattern from your examples and applies it to new inputs.

How Examples Guide the Model

Few-shot works through pattern recognition. The model identifies:

  • Input structure: What kind of data am I receiving?
  • Output format: How should I structure my response?
  • Decision logic: What reasoning connects input to output?

Example: Custom Classification

messages = [
    {"role": "system", "content": "Classify support tickets."},
    {"role": "user", "content": "My payment failed three times"},
    {"role": "assistant", "content": "billing"},
    {"role": "user", "content": "The app crashes on startup"},
    {"role": "assistant", "content": "technical"},
    {"role": "user", "content": "Package hasn't arrived in 2 weeks"},
    {"role": "assistant", "content": "shipping"},
    {"role": "user", "content": "I was charged in wrong currency"}
]
# Model learns pattern → outputs "billing"

How Many Examples?

Task ComplexityRecommendedWhy
Simple classification2-3Pattern is obvious
Custom categories3-5Need to show boundaries
Complex reasoning4-6Multiple steps to demonstrate
Creative/style1-2Showing tone, not logic

More examples aren't always better—they consume tokens and can cause the model to overfit to specific patterns rather than generalizing.

Example Selection Matters

Choose examples that:

  1. Cover edge cases: Include borderline cases that define category boundaries
  2. Are diverse: Don't repeat similar examples
  3. Match expected inputs: Use realistic data similar to production
  4. Demonstrate the hard cases: Easy cases don't teach the model much

Chain-of-Thought Prompting

Chain-of-thought (CoT) prompting encourages the model to reason step-by-step before providing an answer. This dramatically improves performance on tasks requiring:

  • Mathematical calculations
  • Logical reasoning
  • Multi-step analysis
  • Complex decision-making

Why Step-by-Step Reasoning Helps

LLMs generate text token by token. Without explicit reasoning:

  • The model might jump to a conclusion before considering all factors
  • Errors in early reasoning steps compound without correction
  • The model can't "backtrack" once tokens are generated

Chain-of-thought forces the model to show its work, which:

  1. Surfaces reasoning errors that can be caught
  2. Breaks complex problems into manageable steps
  3. Grounds the final answer in explicit logic

Zero-Shot CoT: The Magic Phrase

Simply adding "Let's think step by step" to your prompt triggers reasoning:

# Without CoT - often fails on complex math
prompt = "If a store sells 15% of 80 items on Monday and 20% of the
remaining items on Tuesday, how many items are left?"

# With CoT - much higher accuracy
prompt = """If a store sells 15% of 80 items on Monday and 20% of the
remaining items on Tuesday, how many items are left?

Let's think step by step."""

The model will then work through:

  1. 15% of 80 = 12 items sold Monday
  2. 80 - 12 = 68 items remaining
  3. 20% of 68 = 13.6 ≈ 14 items sold Tuesday
  4. 68 - 14 = 54 items left

When to Use Chain-of-Thought

Use CoTAvoid CoT
Math problemsSimple classification
Logic puzzlesDirect extraction
Multi-step analysisSummarization
Decisions with tradeoffsTranslation
Debugging/troubleshootingFormatting tasks

CoT adds latency and cost. For simple tasks, it's unnecessary overhead.

Structured Output

For applications that consume LLM outputs, unstructured text is problematic. Structured output techniques ensure responses follow a predictable format.

JSON Mode

Most modern APIs support JSON mode, which constrains the model to output valid JSON:

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "system",
            "content": """Extract entities from text. Return JSON:
{
    "people": ["name1", "name2"],
    "organizations": ["org1"],
    "locations": ["loc1"],
    "dates": ["YYYY-MM-DD"]
}"""
        },
        {"role": "user", "content": text}
    ],
    response_format={"type": "json_object"}
)

Schema Enforcement with Pydantic

For type safety and validation, define schemas that the model must follow:

from pydantic import BaseModel
from typing import List

class ExtractedData(BaseModel):
    sentiment: str  # positive, negative, neutral
    confidence: float  # 0.0 to 1.0
    key_topics: List[str]
    summary: str

# Include schema in prompt
schema_json = ExtractedData.model_json_schema()

This gives you:

  • Automatic validation of LLM output
  • Type hints in your IDE
  • Clear documentation of expected format

Best Practices for Structured Output

  1. Always provide the schema in the prompt - Don't assume the model knows your format
  2. Use simple types - Arrays, objects, strings, numbers work reliably
  3. Include example output - Show exactly what valid JSON looks like
  4. Validate before using - Even with JSON mode, validate against your schema

Advanced Techniques

Self-Consistency

For high-stakes decisions, generate multiple responses and aggregate:

  1. Run the same prompt 3-5 times with temperature > 0
  2. Extract the final answer from each response
  3. Take the majority vote

This reduces single-run errors and provides a confidence signal (how often did answers agree?).

Prompt Chaining

Complex tasks often work better as a sequence of simpler prompts:

Task: Research report on a topic

Chain:
1. Generate research questions → questions
2. Answer each question → raw_findings
3. Identify themes → themes
4. Synthesize into report → final_report

Each step has a focused task, making debugging easier and quality higher.

Role Prompting

Assigning a specific persona influences the model's vocabulary, depth, and perspective:

personas = {
    "expert": "You are a senior software architect with 20 years experience.",
    "beginner": "You are explaining to someone new to programming.",
    "skeptic": "You are a critical reviewer looking for flaws.",
}

Role prompting is especially useful for:

  • Technical depth (expert roles)
  • Accessibility (teacher roles)
  • Quality assurance (reviewer roles)

Common Patterns

The RISEN Framework

A structured approach to prompt construction:

  • Role: Who is the AI?
  • Instructions: What should it do?
  • Situation: What's the context?
  • Examples: Demonstrations of desired behavior
  • Narrowing: Constraints and format

Temperature and Sampling

Use CaseTemperatureWhy
Code generation0.0-0.2Determinism matters
Factual Q&A0.0-0.3Accuracy over creativity
Creative writing0.7-1.0Variety and novelty
Brainstorming0.8-1.2Maximum divergence

For reproducible results, use temperature=0 and set a seed parameter if available.

Negative Constraints

Telling the model what NOT to do is often as important as what to do:

constraints = """
- Do NOT include code examples
- Do NOT use bullet points
- Do NOT exceed 100 words
- Do NOT use technical jargon
"""

Negative constraints prevent common failure modes and keep outputs focused.

Testing and Iteration

Treat Prompts Like Code

Good prompt engineering practices:

  1. Version control - Track prompt changes over time
  2. Test suites - Define expected outputs for given inputs
  3. Regression testing - Ensure changes don't break existing cases
  4. A/B testing - Compare prompt variations on real data

Evaluation Metrics

How to measure prompt quality:

MetricMeasuresHow to Calculate
AccuracyCorrectness% matching expected output
ConsistencyReliabilityVariance across multiple runs
Format complianceParseability% valid JSON/schema matches
LatencySpeedResponse time in ms
CostEfficiencyTokens used per request

Iterative Improvement

When a prompt isn't working:

  1. Identify failure mode - What specifically is wrong?
  2. Add constraints - Explicitly forbid the bad behavior
  3. Add examples - Show the correct behavior
  4. Simplify - Maybe the task is too complex for one prompt
  5. Escalate - Try a more capable model

Conclusion

Effective prompt engineering comes down to clear communication:

  1. Be specific - Vague prompts produce vague results
  2. Start simple - Only add complexity when needed
  3. Show examples - Few-shot learning is surprisingly powerful
  4. Request structure - JSON mode enables reliable parsing
  5. Encourage reasoning - Chain-of-thought improves accuracy on hard problems
  6. Test systematically - Treat prompts as code that needs testing

The best prompts evolve through experimentation. Start with a simple approach, measure what's working, and iterate.

References