Jared AI Hub
Published on

LLM Quantization: GPTQ, AWQ, GGUF and When to Use Each

Authors
  • avatar
    Name
    Jared Chung
    Twitter

Introduction

Large Language Models are resource-intensive. A 70B parameter model in full precision requires ~140GB of memory - far beyond consumer GPUs. Quantization solves this by reducing precision from 16-bit floats to 8-bit, 4-bit, or even lower, dramatically reducing memory requirements with minimal quality loss.

What is Quantization?

Quantization maps continuous floating-point values to discrete integer values:

FP16 (16-bit float) → INT8 (8-bit integer) → INT4 (4-bit integer)
PrecisionMemory/Param7B Model70B Model
FP324 bytes28 GB280 GB
FP16/BF162 bytes14 GB140 GB
INT81 byte7 GB70 GB
INT40.5 bytes3.5 GB35 GB

Quantization Methods Comparison

MethodSpeedQualityGPU RequiredBest For
GPTQFastGoodYes (inference)GPU deployment
AWQFastBetterYes (inference)GPU deployment
GGUF/GGMLMediumGoodNo (CPU/GPU)Local/edge
bitsandbytesFastGoodYesTraining/fine-tuning
EETQFastestGoodYesHigh throughput

GPTQ: GPU Quantization

GPTQ (Generalized Post-Training Quantization) is designed for GPU inference with minimal accuracy loss.

Using Pre-quantized Models

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "TheBloke/Llama-2-7B-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True
)

# Generate
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))

Quantizing Your Own Model

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_id = "meta-llama/Llama-2-7b-hf"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Prepare calibration dataset
calibration_data = [
    "Machine learning is a field of artificial intelligence",
    "Deep learning uses neural networks with many layers",
    "Natural language processing deals with text data",
    # Add more diverse examples (typically 128-512 samples)
]

# Quantization config
quantization_config = GPTQConfig(
    bits=4,
    dataset=calibration_data,
    tokenizer=tokenizer,
    group_size=128,
    desc_act=True
)

# Load and quantize
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto"
)

# Save quantized model
model.save_pretrained("llama-2-7b-gptq-4bit")
tokenizer.save_pretrained("llama-2-7b-gptq-4bit")

With AutoGPTQ Library

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer

model_id = "meta-llama/Llama-2-7b-hf"
quantized_model_dir = "llama-2-7b-4bit-gptq"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Prepare examples
examples = [
    tokenizer(text, return_tensors="pt")
    for text in calibration_texts
]

# Quantize config
quantize_config = BaseQuantizeConfig(
    bits=4,
    group_size=128,
    desc_act=True,
    damp_percent=0.1
)

# Load model
model = AutoGPTQForCausalLM.from_pretrained(
    model_id,
    quantize_config=quantize_config
)

# Quantize
model.quantize(examples)

# Save
model.save_quantized(quantized_model_dir)

AWQ: Activation-aware Quantization

AWQ (Activation-aware Weight Quantization) preserves important weights based on activation patterns, often achieving better quality than GPTQ.

Using Pre-quantized AWQ Models

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "TheBloke/Llama-2-7B-AWQ"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

inputs = tokenizer("Explain quantum computing:", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

Quantizing with AWQ

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Llama-2-7b-hf"
quant_path = "llama-2-7b-awq"

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Quantization config
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"
}

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

AWQ with vLLM

from vllm import LLM, SamplingParams

llm = LLM(
    model="TheBloke/Llama-2-13B-AWQ",
    quantization="awq",
    dtype="half"
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=100)
outputs = llm.generate(["Hello, how are you?"], sampling_params)
print(outputs[0].outputs[0].text)

GGUF: CPU-Friendly Quantization

GGUF (GPT-Generated Unified Format) is designed for efficient CPU inference with optional GPU acceleration. It's the format used by llama.cpp and Ollama.

Quantization Levels

Quant TypeBitsSize (7B)QualitySpeed
Q2_K22.5 GBLowFastest
Q3_K_M33.3 GBFairFast
Q4_K_M44.0 GBGoodFast
Q5_K_M54.8 GBVery GoodMedium
Q6_K65.5 GBExcellentMedium
Q8_087.0 GBNear FP16Slower

Using with llama-cpp-python

from llama_cpp import Llama

# Load GGUF model
llm = Llama(
    model_path="./models/llama-2-7b.Q4_K_M.gguf",
    n_ctx=4096,
    n_gpu_layers=35,  # Offload layers to GPU
    n_threads=8
)

# Generate
output = llm(
    "Explain machine learning in simple terms:",
    max_tokens=200,
    temperature=0.7,
    stop=["User:", "\n\n"]
)

print(output["choices"][0]["text"])

Converting to GGUF

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Install requirements
pip install -r requirements.txt

# Convert HuggingFace model to GGUF
python convert_hf_to_gguf.py /path/to/model --outfile model.gguf

# Quantize
./llama-quantize model.gguf model-q4_k_m.gguf Q4_K_M

Using with Ollama

# Create Modelfile
cat > Modelfile << 'EOF'
FROM ./model-q4_k_m.gguf
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
EOF

# Create Ollama model
ollama create my-model -f Modelfile

# Run
ollama run my-model

bitsandbytes: Training-Friendly Quantization

bitsandbytes is ideal when you need to fine-tune or train with quantized models.

8-bit Loading

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=quantization_config,
    device_map="auto"
)

4-bit Loading (QLoRA)

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",  # or "fp4"
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True  # Nested quantization
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=quantization_config,
    device_map="auto"
)

# Ready for QLoRA fine-tuning!

QLoRA Fine-tuning

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Prepare model
model = prepare_model_for_kbit_training(model)

# LoRA config
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06%

Choosing the Right Method

Decision Tree

Need to fine-tune?
├── Yes → bitsandbytes (QLoRA)
└── No
    ├── GPU available?
    │   ├── Yes
    │   │   ├── Need best quality? → AWQ
    │   │   ├── Need fastest inference? → GPTQ
    │   │   └── Using vLLM? → AWQ or GPTQ
    │   └── No → GGUF (llama.cpp/Ollama)
    └── Edge/Mobile? → GGUF with aggressive quantization

Quality Comparison

Testing on common benchmarks (lower perplexity = better):

ModelFP16GPTQ-4bitAWQ-4bitGGUF Q4_K_M
Llama-2-7B5.475.625.585.65
Llama-2-13B4.885.014.975.05
Mistral-7B5.255.385.345.41

Practical Tips

1. Memory Estimation

def estimate_memory(params_billions, bits=16, overhead=1.2):
    """Estimate GPU memory for a model."""
    bytes_per_param = bits / 8
    memory_gb = params_billions * bytes_per_param * overhead
    return f"{memory_gb:.1f} GB"

print(estimate_memory(7, 16))   # 16.8 GB (FP16)
print(estimate_memory(7, 4))    # 4.2 GB (4-bit)
print(estimate_memory(70, 4))   # 42.0 GB (4-bit)

2. Batch Size Optimization

# With quantization, you can often increase batch size
# FP16: batch_size=1, 14GB VRAM
# 4-bit: batch_size=4, 14GB VRAM

3. Mixed Precision

# Keep some layers in higher precision
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,  # Compute in BF16
)

4. Layer Offloading

# GGUF: Offload some layers to GPU
llm = Llama(
    model_path="model.gguf",
    n_gpu_layers=20,  # First 20 layers on GPU
)

Benchmarking Your Setup

import time
import torch

def benchmark_model(model, tokenizer, prompt, num_runs=10):
    """Benchmark inference speed."""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    # Warmup
    for _ in range(3):
        with torch.no_grad():
            model.generate(**inputs, max_new_tokens=50)

    # Benchmark
    times = []
    for _ in range(num_runs):
        torch.cuda.synchronize()
        start = time.perf_counter()

        with torch.no_grad():
            outputs = model.generate(**inputs, max_new_tokens=100)

        torch.cuda.synchronize()
        times.append(time.perf_counter() - start)

    tokens = len(outputs[0]) - len(inputs.input_ids[0])

    return {
        "avg_time": sum(times) / len(times),
        "tokens_per_sec": tokens / (sum(times) / len(times)),
        "memory_gb": torch.cuda.max_memory_allocated() / 1e9
    }

Complete Benchmarking Suite

Here's a comprehensive benchmarking script to evaluate quantized models:

import torch
import time
import gc
from dataclasses import dataclass
from typing import List, Dict, Optional
from transformers import AutoModelForCausalLM, AutoTokenizer
import json

@dataclass
class BenchmarkResult:
    """Results from a single benchmark run."""
    model_name: str
    quantization: str
    prompt_tokens: int
    generated_tokens: int
    time_seconds: float
    memory_gb: float
    tokens_per_second: float

class QuantizationBenchmark:
    """Benchmark quantized models for speed and quality."""

    def __init__(self, prompts: List[str] = None):
        self.prompts = prompts or [
            "Write a Python function to calculate the Fibonacci sequence.",
            "Explain the theory of relativity in simple terms.",
            "What are the key differences between SQL and NoSQL databases?",
        ]
        self.results: List[BenchmarkResult] = []

    def benchmark_model(
        self,
        model_name: str,
        quantization: Optional[str] = None,
        max_new_tokens: int = 100,
        n_runs: int = 3
    ) -> List[BenchmarkResult]:
        """Benchmark a model with specified quantization."""

        print(f"\nBenchmarking: {model_name} ({quantization or 'fp16'})")

        # Load model based on quantization type
        load_kwargs = {"device_map": "auto", "trust_remote_code": True}

        if quantization == "gptq":
            load_kwargs["torch_dtype"] = torch.float16
        elif quantization == "awq":
            load_kwargs["torch_dtype"] = torch.float16
        elif quantization == "4bit":
            from transformers import BitsAndBytesConfig
            load_kwargs["quantization_config"] = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_compute_dtype=torch.bfloat16
            )
        elif quantization == "8bit":
            from transformers import BitsAndBytesConfig
            load_kwargs["quantization_config"] = BitsAndBytesConfig(load_in_8bit=True)
        else:
            load_kwargs["torch_dtype"] = torch.float16

        # Load model and tokenizer
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForCausalLM.from_pretrained(model_name, **load_kwargs)

        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token

        run_results = []

        for prompt in self.prompts:
            for run in range(n_runs):
                # Clear cache
                if torch.cuda.is_available():
                    torch.cuda.empty_cache()
                    torch.cuda.reset_peak_memory_stats()

                # Tokenize
                inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
                prompt_tokens = inputs.input_ids.shape[1]

                # Generate with timing
                torch.cuda.synchronize() if torch.cuda.is_available() else None
                start_time = time.perf_counter()

                with torch.no_grad():
                    outputs = model.generate(
                        **inputs,
                        max_new_tokens=max_new_tokens,
                        do_sample=False,
                        pad_token_id=tokenizer.pad_token_id
                    )

                torch.cuda.synchronize() if torch.cuda.is_available() else None
                elapsed = time.perf_counter() - start_time

                generated_tokens = outputs.shape[1] - prompt_tokens

                # Get memory usage
                if torch.cuda.is_available():
                    memory_gb = torch.cuda.max_memory_allocated() / 1e9
                else:
                    memory_gb = 0

                result = BenchmarkResult(
                    model_name=model_name,
                    quantization=quantization or "fp16",
                    prompt_tokens=prompt_tokens,
                    generated_tokens=generated_tokens,
                    time_seconds=elapsed,
                    memory_gb=memory_gb,
                    tokens_per_second=generated_tokens / elapsed
                )
                run_results.append(result)

                print(f"  Run {run+1}/{n_runs}: {result.tokens_per_second:.1f} tok/s, {result.memory_gb:.2f} GB")

        # Cleanup
        del model
        del tokenizer
        gc.collect()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()

        self.results.extend(run_results)
        return run_results

    def get_summary(self) -> Dict:
        """Get aggregated summary of all benchmark runs."""
        from collections import defaultdict
        from statistics import mean, stdev

        grouped = defaultdict(list)
        for r in self.results:
            key = (r.model_name, r.quantization)
            grouped[key].append(r)

        summary = []
        for (model, quant), runs in grouped.items():
            tps = [r.tokens_per_second for r in runs]
            mem = [r.memory_gb for r in runs]

            summary.append({
                "model": model,
                "quantization": quant,
                "avg_tokens_per_sec": mean(tps),
                "std_tokens_per_sec": stdev(tps) if len(tps) > 1 else 0,
                "avg_memory_gb": mean(mem),
                "n_runs": len(runs)
            })

        return summary

    def save_results(self, filepath: str):
        """Save results to JSON."""
        data = {
            "results": [vars(r) for r in self.results],
            "summary": self.get_summary()
        }
        with open(filepath, 'w') as f:
            json.dump(data, f, indent=2)


# Example usage
if __name__ == "__main__":
    benchmark = QuantizationBenchmark()

    # Compare different quantizations of the same model
    models_to_test = [
        ("TheBloke/Llama-2-7B-fp16", None),
        ("TheBloke/Llama-2-7B-GPTQ", "gptq"),
        ("TheBloke/Llama-2-7B-AWQ", "awq"),
        # For bitsandbytes, use the base model
        ("meta-llama/Llama-2-7b-hf", "4bit"),
        ("meta-llama/Llama-2-7b-hf", "8bit"),
    ]

    for model_name, quant in models_to_test:
        try:
            benchmark.benchmark_model(model_name, quant, n_runs=3)
        except Exception as e:
            print(f"Failed to benchmark {model_name}: {e}")

    # Print summary
    print("\n=== Summary ===")
    for s in benchmark.get_summary():
        print(f"{s['model']} ({s['quantization']}): "
              f"{s['avg_tokens_per_sec']:.1f} tok/s, "
              f"{s['avg_memory_gb']:.2f} GB")

    benchmark.save_results("benchmark_results.json")

Quality Evaluation

Quantization reduces model quality. Here's how to measure the impact:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import math

def calculate_perplexity(model, tokenizer, texts: list, max_length: int = 512) -> float:
    """Calculate perplexity on a set of texts."""
    model.eval()
    total_loss = 0
    total_tokens = 0

    with torch.no_grad():
        for text in texts:
            inputs = tokenizer(
                text,
                return_tensors="pt",
                truncation=True,
                max_length=max_length
            ).to(model.device)

            outputs = model(**inputs, labels=inputs["input_ids"])
            loss = outputs.loss.item()

            total_loss += loss * inputs["input_ids"].shape[1]
            total_tokens += inputs["input_ids"].shape[1]

    avg_loss = total_loss / total_tokens
    perplexity = math.exp(avg_loss)

    return perplexity

# Test texts (use a proper evaluation dataset in practice)
eval_texts = [
    "The quick brown fox jumps over the lazy dog.",
    "Machine learning is transforming how we interact with technology.",
    "In the beginning, there was nothing but vast emptiness.",
]

# Compare perplexity across quantizations
# Lower perplexity = better
results = {}

# FP16 baseline
model_fp16 = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
results["fp16"] = calculate_perplexity(model_fp16, tokenizer, eval_texts)
del model_fp16

# GPTQ
model_gptq = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7B-GPTQ",
    device_map="auto"
)
results["gptq"] = calculate_perplexity(model_gptq, tokenizer, eval_texts)
del model_gptq

print("Perplexity comparison:")
for quant, ppl in results.items():
    print(f"  {quant}: {ppl:.2f}")

Troubleshooting Common Issues

CUDA Out of Memory

# Solution 1: Reduce batch size
model.generate(inputs, max_new_tokens=50)  # Instead of 512

# Solution 2: Use more aggressive quantization
from transformers import BitsAndBytesConfig
config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,  # Not bfloat16
    bnb_4bit_use_double_quant=True  # Nested quantization
)

# Solution 3: Offload to CPU
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    offload_folder="offload",
    offload_state_dict=True
)

GPTQ Loading Errors

# Install correct packages
pip install auto-gptq optimum

# For CUDA 11.8
pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/

AWQ Compatibility

# AWQ requires specific GPU architectures (Ampere+)
# For older GPUs, use GPTQ instead

# Check GPU compatibility
import torch
if torch.cuda.is_available():
    capability = torch.cuda.get_device_capability()
    print(f"CUDA capability: {capability}")
    if capability[0] < 8:
        print("Warning: AWQ may not work optimally on this GPU")

Decision Guide

Need to fine-tune?
├── Yes → bitsandbytes 4-bit (QLoRA)
└── No
    ├── Have GPU?
    │   ├── Yes
    │   │   ├── Ampere+ GPU (RTX 30xx, 40xx)?
    │   │   │   ├── Yes → AWQ (best quality + speed)
    │   │   │   └── No → GPTQ (wider compatibility)
    │   │   └── Need multiple models?
    │   │       └── Use vLLM with quantization
    │   └── No → GGUF (llama.cpp/Ollama)
    └── Edge deployment? → GGUF with aggressive quantization

Conclusion

Quantization makes powerful LLMs accessible on consumer hardware:

  • GPTQ: Best for GPU inference with good speed and wide compatibility
  • AWQ: Better quality than GPTQ on modern GPUs (Ampere+)
  • GGUF: Best for CPU or mixed CPU/GPU inference
  • bitsandbytes: Essential for fine-tuning (QLoRA)

Key recommendations:

  1. Start with pre-quantized models from TheBloke on Hugging Face
  2. Use Q4_K_M for GGUF - best balance of size and quality
  3. Benchmark on your hardware - results vary significantly
  4. Evaluate quality - measure perplexity on your use case
  5. Consider AWQ for production GPU deployments

References