Quantization for Small Models: A Practical, Reproducible Guide

This article outlines a clear, reproducible workflow for applying quantization to small language models. The objective is to reduce memory usage, improve inference efficiency, and retain acceptable accuracy on constrained hardware.

Purpose

Quantization converts model weights from floating-point formats (fp32 or fp16) into lower-precision representations such as int8 or int4. This reduces VRAM and RAM consumption and enables running larger models on limited devices without modifying model architecture.

Scientific Basis

Quantization reduces the numeric precision of weights while preserving structural relationships.
4-bit methods apply additional techniques (double quantization, grouped quantization) to minimize accuracy loss.
Inference is feasible because many transformer components are resilient to reduced precision.

When to Use Quantization

Scenario	Suitability
Running models on 4–8 GB GPUs	Highly suitable
CPU-only inference	Useful for memory reduction
Training	Not recommended (quantization is for inference)
RAG or lightweight assistants	Appropriate

Environment Setup

Create an isolated environment for quantized inference:

cat <<'Eof' setup_quant.sh
#!/usr/bin/env bash
set -euo pipefail

VENV_DIR=quant-venv
python3 -m venv ${VENV_DIR}
source ${VENV_DIR}/bin/activate

pip install --upgrade pip setuptools wheel
pip install transformers accelerate bitsandbytes safetensors

echo "Environment ready."
echo "Activate with: source ${VENV_DIR}/bin/activate"
Eof

Run the setup:

chmod +x setup_quant.sh
./setup_quant.sh

Quantized Loading Script

The script below loads a small LLM with 4-bit quantization using bitsandbytes. Adjust the model identifier based on available hardware.

cat <<'Eof' run_quantized.py
#!/usr/bin/env python3

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# ===== USER SETTINGS =====
model_id = "your-small-llm"  # replace with a 2B–7B model supporting 4-bit load
prompt = "Summarize the concept of quantization in one short paragraph."
max_new_tokens = 120
# =========================

# Quantization configuration
quant = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quant,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

# Inference
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(inputs["input_ids"], max_new_tokens=max_new_tokens)

print(tokenizer.decode(out[0], skip_special_tokens=True))
Eof

Run the Quantized Model

source quant-venv/bin/activate
python run_quantized.py

Expected Benefits

Memory reduction: 4-bit weights cut memory usage by ~75% compared to fp16.
Feasible inference: models that normally require 12-16 GB VRAM may fit into 6–8 GB.
No architecture changes: quantization occurs at load time.

Evaluation Considerations

Quantization introduces small numerical deviations. Evaluate:

Answer fidelity for domain tasks
Latency and throughput
Hallucination behaviour compared to fp16 baseline
Stability for long outputs

Common Issues

Model not loading: some models do not ship with quantization-compatible layers.
OOM errors: reduce sequence length or use smaller models.
Slow inference: CPU fallback indicates that GPU-specific packages were not installed correctly.

Summary

Quantization is a practical method to run small LLMs on constrained hardware. It reduces memory requirements, maintains acceptable accuracy, and integrates cleanly with the transformers ecosystem. This workflow provides a reproducible baseline for deploying quantized models in local environments.

Search This Blog

Hipatic