Jared AI Hub
Published on

LoRA and QLoRA: Efficient LLM Fine-tuning on Consumer Hardware

Authors
  • avatar
    Name
    Jared Chung
    Twitter

Introduction

Fine-tuning a 7-billion parameter model requires approximately 70GB of GPU memory—far beyond consumer hardware. Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA solve this by training only a small fraction of parameters while keeping the rest frozen.

The key insight: the weight updates during fine-tuning have low intrinsic rank. Instead of updating a 768x768 matrix (590,000 parameters), we can learn two smaller matrices (768x16 and 16x768) that approximate the update with just 24,000 parameters—a 24x reduction.

This guide explains how LoRA works, when to use it, and how to configure it effectively.

The Memory Problem

Why Full Fine-tuning is Expensive

Training a neural network requires storing multiple components in memory:

ComponentDescriptionMemory (7B model, FP16)
Model weightsThe parameters themselves~14 GB
GradientsDerivatives for each parameter~14 GB
Optimizer statesAdam momentum and variance~28 GB
ActivationsIntermediate values for backprop~10+ GB
Total~70 GB

A 7B model in FP16 is 14GB, but training requires 5x that for gradients and optimizer states.

The PEFT Solution

PEFT methods reduce memory by training only a subset of parameters:

MethodApproachTrainable %
Full fine-tuningUpdate all parameters100%
LoRAAdd low-rank adapters0.1-1%
QLoRALoRA + 4-bit base model0.1-1%
Prefix tuningLearn prompt embeddings<0.1%

LoRA is the most popular because it balances efficiency with expressiveness.

How LoRA Works

LoRA Architecture

The Low-Rank Hypothesis

The core assumption: when fine-tuning for a specific task, the weight changes have low intrinsic dimensionality. A full matrix update ΔW can be approximated by the product of two smaller matrices:

ΔW ≈ A × B

Where:
- W is d×d (e.g., 768×768 = 590,000 parameters)
- A is d×r (e.g., 768×16 = 12,288 parameters)
- B is r×d (e.g., 16×768 = 12,288 parameters)
- Total: 24,576 parameters (24x reduction)

The Forward Pass

During inference, the output combines the frozen weights with the learned adaptation:

h = Wx + (A × B)x × (α/r)

Where:

  • W: Original frozen weights
  • A × B: Low-rank update (trainable)
  • α/r: Scaling factor (alpha divided by rank)

Key Insight: Zero Initialization

LoRA initializes:

  • A: Random values (Kaiming initialization)
  • B: All zeros

This means initially ΔW = A × B = 0, so the model starts with the exact pre-trained behavior. Training gradually learns the adaptation from this stable starting point.

Key Parameters

Rank (r)

Controls the capacity of the adaptation—how much change the model can learn.

RankParameters per LayerUse Case
4Very fewSimple tasks, small datasets
8LowBasic instruction following
16ModerateGood default for most tasks
32HigherComplex domain adaptation
64+ManyVery complex tasks

Guidance:

  • Start with r=16
  • Increase if model underfits (can't learn the task)
  • Decrease if model overfits (memorizes training data)

Alpha (α)

Controls the strength of the adaptation through the scaling factor α/r.

α/r RatioEffect
0.5Conservative adaptation
1.0Balanced
2.0Common choice
4.0Strong adaptation

Common patterns:

  • α = r: Scaling factor = 1 (balanced)
  • α = 2r: Scaling factor = 2 (common, stronger adaptation)

Target Modules

Which layers to apply LoRA to:

StrategyModulesTrade-off
Attention onlyq_proj, v_projMinimal parameters, fast
Full attentionq_proj, k_proj, v_proj, o_projBetter attention control
All linear+ gate_proj, up_proj, down_projMaximum capacity

Recommendation: Start with full attention (q,k,v,o projections). Add feedforward layers if needed.

QLoRA: Extreme Efficiency

QLoRA combines LoRA with 4-bit quantization:

ComponentPrecisionMemory
Base model4-bit (NF4)~3.5 GB for 7B
LoRA adapters16-bit~50 MB
Training overhead16-bitMinimal

Result: Fine-tune a 70B model on a 24GB GPU.

NF4: The Secret Sauce

Normal Float 4-bit (NF4) is optimized for neural network weights that follow a normal distribution:

  • Regular 4-bit: 16 evenly spaced quantization levels
  • NF4: 16 levels optimized for normal distributions

NF4 has significantly lower quantization error for typical LLM weights.

Double Quantization

QLoRA further reduces memory by quantizing the quantization constants themselves, saving an additional ~0.5 bits per parameter.

Practical Implementation

Using PEFT Library

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)

# Configure LoRA
lora_config = LoraConfig(
    r=16,                           # Rank
    lora_alpha=32,                  # Alpha (scaling = 32/16 = 2)
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.062

QLoRA Setup

from transformers import BitsAndBytesConfig
import torch

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NormalFloat4
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True        # Double quantization
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

# Apply LoRA to quantized model
model = get_peft_model(model, lora_config)

Training Configuration

LoRA requires different hyperparameters than full fine-tuning:

ParameterFull Fine-tuningLoRA
Learning rate1e-5 to 5e-51e-4 to 5e-4
Weight decay0.010.001
Warmup5-10%3%

LoRA parameters start at zero, so they need higher learning rates to learn effectively.

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./lora_output",
    learning_rate=2e-4,          # Higher than full fine-tuning
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    num_train_epochs=3,
    warmup_ratio=0.03,
    weight_decay=0.001,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
)

Multi-Task Adapters

A powerful LoRA feature: train multiple adapters for different tasks using the same base model.

# Train separate adapters
# ./adapters/creative_writing/  (r=32, high capacity)
# ./adapters/code_generation/   (r=16, balanced)
# ./adapters/summarization/     (r=8, focused)

# Load and switch at inference
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("base_model")

# For creative writing
creative_model = PeftModel.from_pretrained(base_model, "./adapters/creative_writing")

# For code generation
code_model = PeftModel.from_pretrained(base_model, "./adapters/code_generation")

Benefits:

  • Single base model in memory
  • Swap adapters in milliseconds
  • Each adapter is tiny (~50-100MB)

Common Pitfalls

Rank Too Low

Symptoms: Model doesn't learn the task, stuck at baseline performance.

Solution: Increase rank (r=8 → r=16 → r=32). Complex tasks need more capacity.

Rank Too High

Symptoms: Quick overfitting, poor generalization, training loss drops but validation doesn't.

Solution: Decrease rank, add dropout, use more diverse data.

Wrong Learning Rate

Symptoms: Training doesn't converge or is unstable.

Solution: LoRA needs 5-10x higher LR than full fine-tuning. Start at 1e-4.

Missing Target Modules

Symptoms: Limited adaptation, model behavior doesn't change much.

Solution: Add more target modules. Start with attention projections, add feedforward if needed.

When to Use LoRA vs Full Fine-tuning

ScenarioRecommendation
Limited GPU memoryLoRA (or QLoRA)
Multiple tasks from same baseLoRA (adapter per task)
Maximum possible qualityFull fine-tuning (if resources allow)
Quick experimentationLoRA
Production deploymentLoRA (smaller, faster to swap)
Very simple taskLoRA with low rank
Complex domain shiftFull fine-tuning or LoRA with high rank

In practice, LoRA often matches full fine-tuning quality at a fraction of the cost.

Merging Adapters

After training, you can merge LoRA weights into the base model for inference efficiency:

# Merge and save
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged_model")

# No adapter overhead at inference
# But loses ability to swap adapters

Use merging when:

  • Deploying a single fine-tuned version
  • Maximum inference speed needed
  • Don't need adapter flexibility

Conclusion

LoRA democratizes LLM fine-tuning by making it accessible on consumer hardware:

Core concept:

  • Weight updates have low intrinsic rank
  • Learn A×B instead of full ΔW
  • 10-100x parameter reduction

Key parameters:

  • Rank (r=16 is a good default)
  • Alpha (α=2r is common)
  • Target modules (attention + optionally feedforward)

Practical tips:

  • Use higher learning rates than full fine-tuning
  • Start simple, add capacity if underfitting
  • QLoRA enables even larger models

LoRA with default settings (r=16, α=32) on attention layers is a robust starting point for most fine-tuning tasks.

References