Introduction

Large Language Models are resource-intensive. A 70B parameter model in full precision requires ~140GB of memory - far beyond consumer GPUs. Quantization solves this by reducing precision from 16-bit floats to 8-bit, 4-bit, or even lower, dramatically reducing memory requirements with minimal quality loss.

What is Quantization?

Quantization maps continuous floating-point values to discrete integer values:

FP16 (16-bit float) → INT8 (8-bit integer) → INT4 (4-bit integer)

Precision	Memory/Param	7B Model	70B Model
FP32	4 bytes	28 GB	280 GB
FP16/BF16	2 bytes	14 GB	140 GB
INT8	1 byte	7 GB	70 GB
INT4	0.5 bytes	3.5 GB	35 GB

Quantization Methods Comparison

Method	Speed	Quality	GPU Required	Best For
GPTQ	Fast	Good	Yes (inference)	GPU deployment
AWQ	Fast	Better	Yes (inference)	GPU deployment
GGUF/GGML	Medium	Good	No (CPU/GPU)	Local/edge
bitsandbytes	Fast	Good	Yes	Training/fine-tuning
EETQ	Fastest	Good	Yes	High throughput

GPTQ: GPU Quantization

GPTQ (Generalized Post-Training Quantization) is designed for GPU inference with minimal accuracy loss.

Using Pre-quantized Models

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "TheBloke/Llama-2-7B-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True
)

# Generate
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))

Quantizing Your Own Model

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_id = "meta-llama/Llama-2-7b-hf"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Prepare calibration dataset
calibration_data = [
    "Machine learning is a field of artificial intelligence",
    "Deep learning uses neural networks with many layers",
    "Natural language processing deals with text data",
    # Add more diverse examples (typically 128-512 samples)
]

# Quantization config
quantization_config = GPTQConfig(
    bits=4,
    dataset=calibration_data,
    tokenizer=tokenizer,
    group_size=128,
    desc_act=True
)

# Load and quantize
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto"
)

# Save quantized model
model.save_pretrained("llama-2-7b-gptq-4bit")
tokenizer.save_pretrained("llama-2-7b-gptq-4bit")

With AutoGPTQ Library

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer

model_id = "meta-llama/Llama-2-7b-hf"
quantized_model_dir = "llama-2-7b-4bit-gptq"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Prepare examples
examples = [
    tokenizer(text, return_tensors="pt")
    for text in calibration_texts
]

# Quantize config
quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=True,
    damp_percent=0.1
)

# Load model
model = AutoGPTQForCausalLM.from_pretrained(
    model_id,
    quantize_config=quantize_config
)

# Quantize
model.quantize(examples)

# Save
model.save_quantized(quantized_model_dir)

AWQ: Activation-aware Quantization

AWQ (Activation-aware Weight Quantization) preserves important weights based on activation patterns, often achieving better quality than GPTQ.

Using Pre-quantized AWQ Models

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "TheBloke/Llama-2-7B-AWQ"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

inputs = tokenizer("Explain quantum computing:", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

Quantizing with AWQ

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Llama-2-7b-hf"
quant_path = "llama-2-7b-awq"

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Quantization config
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

AWQ with vLLM

from vllm import LLM, SamplingParams

llm = LLM(
    model="TheBloke/Llama-2-13B-AWQ",
    quantization="awq",
    dtype="half"
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=100)
outputs = llm.generate(["Hello, how are you?"], sampling_params)
print(outputs[0].outputs[0].text)

GGUF: CPU-Friendly Quantization

GGUF (GPT-Generated Unified Format) is designed for efficient CPU inference with optional GPU acceleration. It's the format used by llama.cpp and Ollama.

Quantization Levels

Quant Type	Bits	Size (7B)	Quality	Speed
Q2_K	2	2.5 GB	Low	Fastest
Q3_K_M	3	3.3 GB	Fair	Fast
Q4_K_M	4	4.0 GB	Good	Fast
Q5_K_M	5	4.8 GB	Very Good	Medium
Q6_K	6	5.5 GB	Excellent	Medium
Q8_0	8	7.0 GB	Near FP16	Slower

Using with llama-cpp-python

from llama_cpp import Llama

# Load GGUF model
llm = Llama(
    model_path="./models/llama-2-7b.Q4_K_M.gguf",
    n_ctx=4096,
    n_gpu_layers=35,  # Offload layers to GPU
    n_threads=8
)

# Generate
output = llm(
    "Explain machine learning in simple terms:",
    max_tokens=200,
    temperature=0.7,
    stop=["User:", "\n\n"]
)

print(output["choices"][0]["text"])

Converting to GGUF

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Install requirements
pip install -r requirements.txt

# Convert HuggingFace model to GGUF
python convert_hf_to_gguf.py /path/to/model --outfile model.gguf

# Quantize
./llama-quantize model.gguf model-q4_k_m.gguf Q4_K_M

Using with Ollama

# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./model-q4_k_m.gguf
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
EOF

# Create Ollama model
ollama create my-model -f Modelfile

# Run
ollama run my-model

bitsandbytes: Training-Friendly Quantization

bitsandbytes is ideal when you need to fine-tune or train with quantized models.

8-bit Loading

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=quantization_config,
    device_map="auto"
)

4-bit Loading (QLoRA)

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",  # or "fp4"
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True  # Nested quantization
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=quantization_config,
    device_map="auto"
)

# Ready for QLoRA fine-tuning!

QLoRA Fine-tuning

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Prepare model
model = prepare_model_for_kbit_training(model)

# LoRA config
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06%

Choosing the Right Method

Decision Tree

Need to fine-tune?
├── Yes → bitsandbytes (QLoRA)
└── No
    ├── GPU available?
    │   ├── Yes
    │   │   ├── Need best quality? → AWQ
    │   │   ├── Need fastest inference? → GPTQ
    │   │   └── Using vLLM? → AWQ or GPTQ
    │   └── No → GGUF (llama.cpp/Ollama)
    └── Edge/Mobile? → GGUF with aggressive quantization

Quality Comparison

Testing on common benchmarks (lower perplexity = better):

Model	FP16	GPTQ-4bit	AWQ-4bit	GGUF Q4_K_M
Llama-2-7B	5.47	5.62	5.58	5.65
Llama-2-13B	4.88	5.01	4.97	5.05
Mistral-7B	5.25	5.38	5.34	5.41

Practical Tips

1. Memory Estimation

def estimate_memory(params_billions, bits=16, overhead=1.2):
    """Estimate GPU memory for a model."""
    bytes_per_param = bits / 8
    memory_gb = params_billions * bytes_per_param * overhead
    return f"{memory_gb:.1f} GB"

print(estimate_memory(7, 16))   # 16.8 GB (FP16)
print(estimate_memory(7, 4))    # 4.2 GB (4-bit)
print(estimate_memory(70, 4))   # 42.0 GB (4-bit)

2. Batch Size Optimization

# With quantization, you can often increase batch size
# FP16: batch_size=1, 14GB VRAM
# 4-bit: batch_size=4, 14GB VRAM

3. Mixed Precision

# Keep some layers in higher precision
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,  # Compute in BF16
)

4. Layer Offloading

# GGUF: Offload some layers to GPU
llm = Llama(
    model_path="model.gguf",
    n_gpu_layers=20,  # First 20 layers on GPU
)

Benchmarking Your Setup

import time
import torch

def benchmark_model(model, tokenizer, prompt, num_runs=10):
    """Benchmark inference speed."""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Warmup
    for _ in range(3):
        with torch.no_grad():
            model.generate(**inputs, max_new_tokens=50)

    # Benchmark
    times = []
    for _ in range(num_runs):
        torch.cuda.synchronize()
        start = time.perf_counter()

        with torch.no_grad():
            outputs = model.generate(**inputs, max_new_tokens=100)

        torch.cuda.synchronize()
        times.append(time.perf_counter() - start)

    tokens = len(outputs[0]) - len(inputs.input_ids[0])

    return {
        "avg_time": sum(times) / len(times),
        "tokens_per_sec": tokens / (sum(times) / len(times)),
        "memory_gb": torch.cuda.max_memory_allocated() / 1e9
    }

Complete Benchmarking Suite

Here's a comprehensive benchmarking script to evaluate quantized models:

import torch
import time
import gc
from dataclasses import dataclass
from typing import List, Dict, Optional
from transformers import AutoModelForCausalLM, AutoTokenizer
import json

@dataclass
class BenchmarkResult:
    """Results from a single benchmark run."""
    model_name: str
    quantization: str
    prompt_tokens: int
    generated_tokens: int
    time_seconds: float
    memory_gb: float
    tokens_per_second: float

class QuantizationBenchmark:
    """Benchmark quantized models for speed and quality."""

    def __init__(self, prompts: List[str] = None):
        self.prompts = prompts or [
            "Write a Python function to calculate the Fibonacci sequence.",
            "Explain the theory of relativity in simple terms.",
            "What are the key differences between SQL and NoSQL databases?",
        ]
        self.results: List[BenchmarkResult] = []

    def benchmark_model(
        self,
        model_name: str,
        quantization: Optional[str] = None,
        max_new_tokens: int = 100,
        n_runs: int = 3
    ) -> List[BenchmarkResult]:
        """Benchmark a model with specified quantization."""

        print(f"\nBenchmarking: {model_name} ({quantization or 'fp16'})")

        # Load model based on quantization type
        load_kwargs = {"device_map": "auto", "trust_remote_code": True}

        if quantization == "gptq":
            load_kwargs["torch_dtype"] = torch.float16
        elif quantization == "awq":
            load_kwargs["torch_dtype"] = torch.float16
        elif quantization == "4bit":
            from transformers import BitsAndBytesConfig
            load_kwargs["quantization_config"] = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_compute_dtype=torch.bfloat16
            )
        elif quantization == "8bit":
            from transformers import BitsAndBytesConfig
            load_kwargs["quantization_config"] = BitsAndBytesConfig(load_in_8bit=True)
        else:
            load_kwargs["torch_dtype"] = torch.float16

        # Load model and tokenizer
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForCausalLM.from_pretrained(model_name, **load_kwargs)

        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token

        run_results = []

        for prompt in self.prompts:
            for run in range(n_runs):
                # Clear cache
                if torch.cuda.is_available():
                    torch.cuda.empty_cache()
                    torch.cuda.reset_peak_memory_stats()

                # Tokenize
                inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
                prompt_tokens = inputs.input_ids.shape[1]

                # Generate with timing
                torch.cuda.synchronize() if torch.cuda.is_available() else None
                start_time = time.perf_counter()

                with torch.no_grad():
                    outputs = model.generate(
                        **inputs,
                        max_new_tokens=max_new_tokens,
                        do_sample=False,
                        pad_token_id=tokenizer.pad_token_id
                    )

                torch.cuda.synchronize() if torch.cuda.is_available() else None
                elapsed = time.perf_counter() - start_time

                generated_tokens = outputs.shape[1] - prompt_tokens

                # Get memory usage
                if torch.cuda.is_available():
                    memory_gb = torch.cuda.max_memory_allocated() / 1e9
                else:
                    memory_gb = 0

                result = BenchmarkResult(
                    model_name=model_name,
                    quantization=quantization or "fp16",
                    prompt_tokens=prompt_tokens,
                    generated_tokens=generated_tokens,
                    time_seconds=elapsed,
                    memory_gb=memory_gb,
                    tokens_per_second=generated_tokens / elapsed
                )
                run_results.append(result)

                print(f"  Run {run+1}/{n_runs}: {result.tokens_per_second:.1f} tok/s, {result.memory_gb:.2f} GB")

        # Cleanup
        del model
        del tokenizer
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

        self.results.extend(run_results)
        return run_results

    def get_summary(self) -> Dict:
        """Get aggregated summary of all benchmark runs."""
        from collections import defaultdict
        from statistics import mean, stdev

        grouped = defaultdict(list)
        for r in self.results:
            key = (r.model_name, r.quantization)
            grouped[key].append(r)

        summary = []
        for (model, quant), runs in grouped.items():
            tps = [r.tokens_per_second for r in runs]
            mem = [r.memory_gb for r in runs]

            summary.append({
                "model": model,
                "quantization": quant,
                "avg_tokens_per_sec": mean(tps),
                "std_tokens_per_sec": stdev(tps) if len(tps) > 1 else 0,
                "avg_memory_gb": mean(mem),
                "n_runs": len(runs)
            })

        return summary

    def save_results(self, filepath: str):
        """Save results to JSON."""
        data = {
            "results": [vars(r) for r in self.results],
            "summary": self.get_summary()
        }
        with open(filepath, 'w') as f:
            json.dump(data, f, indent=2)


# Example usage
if __name__ == "__main__":
    benchmark = QuantizationBenchmark()

    # Compare different quantizations of the same model
    models_to_test = [
        ("TheBloke/Llama-2-7B-fp16", None),
        ("TheBloke/Llama-2-7B-GPTQ", "gptq"),
        ("TheBloke/Llama-2-7B-AWQ", "awq"),
        # For bitsandbytes, use the base model
        ("meta-llama/Llama-2-7b-hf", "4bit"),
        ("meta-llama/Llama-2-7b-hf", "8bit"),
    ]

    for model_name, quant in models_to_test:
        try:
            benchmark.benchmark_model(model_name, quant, n_runs=3)
        except Exception as e:
            print(f"Failed to benchmark {model_name}: {e}")

    # Print summary
    print("\n=== Summary ===")
    for s in benchmark.get_summary():
        print(f"{s['model']} ({s['quantization']}): "
              f"{s['avg_tokens_per_sec']:.1f} tok/s, "
              f"{s['avg_memory_gb']:.2f} GB")

    benchmark.save_results("benchmark_results.json")

Quality Evaluation

Quantization reduces model quality. Here's how to measure the impact:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import math

def calculate_perplexity(model, tokenizer, texts: list, max_length: int = 512) -> float:
    """Calculate perplexity on a set of texts."""
    model.eval()
    total_loss = 0
    total_tokens = 0

    with torch.no_grad():
        for text in texts:
            inputs = tokenizer(
                text,
                return_tensors="pt",
                truncation=True,
                max_length=max_length
            ).to(model.device)

            outputs = model(**inputs, labels=inputs["input_ids"])
            loss = outputs.loss.item()

            total_loss += loss * inputs["input_ids"].shape[1]
            total_tokens += inputs["input_ids"].shape[1]

    avg_loss = total_loss / total_tokens
    perplexity = math.exp(avg_loss)

    return perplexity

# Test texts (use a proper evaluation dataset in practice)
eval_texts = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is transforming how we interact with technology.",
    "In the beginning, there was nothing but vast emptiness.",
]

# Compare perplexity across quantizations
# Lower perplexity = better
results = {}

# FP16 baseline
model_fp16 = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
results["fp16"] = calculate_perplexity(model_fp16, tokenizer, eval_texts)
del model_fp16

# GPTQ
model_gptq = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7B-GPTQ",
    device_map="auto"
)
results["gptq"] = calculate_perplexity(model_gptq, tokenizer, eval_texts)
del model_gptq

print("Perplexity comparison:")
for quant, ppl in results.items():
    print(f"  {quant}: {ppl:.2f}")

Troubleshooting Common Issues

CUDA Out of Memory

# Solution 1: Reduce batch size
model.generate(inputs, max_new_tokens=50)  # Instead of 512

# Solution 2: Use more aggressive quantization
from transformers import BitsAndBytesConfig
config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,  # Not bfloat16
    bnb_4bit_use_double_quant=True  # Nested quantization
)

# Solution 3: Offload to CPU
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    offload_folder="offload",
    offload_state_dict=True
)

GPTQ Loading Errors

# Install correct packages
pip install auto-gptq optimum

# For CUDA 11.8
pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/

AWQ Compatibility

# AWQ requires specific GPU architectures (Ampere+)
# For older GPUs, use GPTQ instead

# Check GPU compatibility
import torch
if torch.cuda.is_available():
    capability = torch.cuda.get_device_capability()
    print(f"CUDA capability: {capability}")
    if capability[0] < 8:
        print("Warning: AWQ may not work optimally on this GPU")

Decision Guide

Need to fine-tune?
├── Yes → bitsandbytes 4-bit (QLoRA)
└── No
    ├── Have GPU?
    │   ├── Yes
    │   │   ├── Ampere+ GPU (RTX 30xx, 40xx)?
    │   │   │   ├── Yes → AWQ (best quality + speed)
    │   │   │   └── No → GPTQ (wider compatibility)
    │   │   └── Need multiple models?
    │   │       └── Use vLLM with quantization
    │   └── No → GGUF (llama.cpp/Ollama)
    └── Edge deployment? → GGUF with aggressive quantization

Conclusion

Quantization makes powerful LLMs accessible on consumer hardware:

GPTQ: Best for GPU inference with good speed and wide compatibility
AWQ: Better quality than GPTQ on modern GPUs (Ampere+)
GGUF: Best for CPU or mixed CPU/GPU inference
bitsandbytes: Essential for fine-tuning (QLoRA)

Key recommendations:

Start with pre-quantized models from TheBloke on Hugging Face
Use Q4_K_M for GGUF - best balance of size and quality
Benchmark on your hardware - results vary significantly
Evaluate quality - measure perplexity on your use case
Consider AWQ for production GPU deployments

References

GPTQ Paper: "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers"
AWQ Paper: "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration"
llama.cpp: https://github.com/ggerganov/llama.cpp
bitsandbytes: https://github.com/TimDettmers/bitsandbytes
AutoGPTQ: https://github.com/PanQiWei/AutoGPTQ
AutoAWQ: https://github.com/casper-hansen/AutoAWQ