Quantization for Small Models: A Practical, Reproducible Guide

Quantization for Small Models: A Practical, Reproducible Guide

This article outlines a clear, reproducible workflow for applying quantization to small language models. The objective is to reduce memory usage, improve inference efficiency, and retain acceptable accuracy on constrained hardware.

Purpose

Quantization converts model weights from floating-point formats (fp32 or fp16) into lower-precision representations such as int8 or int4. This reduces VRAM and RAM consumption and enables running larger models on limited devices without modifying model architecture.

Scientific Basis

  • Quantization reduces the numeric precision of weights while preserving structural relationships.
  • 4-bit methods apply additional techniques (double quantization, grouped quantization) to minimize accuracy loss.
  • Inference is feasible because many transformer components are resilient to reduced precision.

When to Use Quantization

ScenarioSuitability
Running models on 4–8 GB GPUsHighly suitable
CPU-only inferenceUseful for memory reduction
TrainingNot recommended (quantization is for inference)
RAG or lightweight assistantsAppropriate

Environment Setup

Create an isolated environment for quantized inference:

cat <<'Eof' setup_quant.sh
#!/usr/bin/env bash
set -euo pipefail

VENV_DIR=quant-venv
python3 -m venv ${VENV_DIR}
source ${VENV_DIR}/bin/activate

pip install --upgrade pip setuptools wheel
pip install transformers accelerate bitsandbytes safetensors

echo "Environment ready."
echo "Activate with: source ${VENV_DIR}/bin/activate"
Eof

Run the setup:

chmod +x setup_quant.sh
./setup_quant.sh

Quantized Loading Script

The script below loads a small LLM with 4-bit quantization using bitsandbytes. Adjust the model identifier based on available hardware.

cat <<'Eof' run_quantized.py
#!/usr/bin/env python3

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# ===== USER SETTINGS =====
model_id = "your-small-llm"  # replace with a 2B–7B model supporting 4-bit load
prompt = "Summarize the concept of quantization in one short paragraph."
max_new_tokens = 120
# =========================

# Quantization configuration
quant = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quant,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

# Inference
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(inputs["input_ids"], max_new_tokens=max_new_tokens)

print(tokenizer.decode(out[0], skip_special_tokens=True))
Eof

Run the Quantized Model

source quant-venv/bin/activate
python run_quantized.py

Expected Benefits

  • Memory reduction: 4-bit weights cut memory usage by ~75% compared to fp16.
  • Feasible inference: models that normally require 12-16 GB VRAM may fit into 6–8 GB.
  • No architecture changes: quantization occurs at load time.

Evaluation Considerations

Quantization introduces small numerical deviations. Evaluate:

  • Answer fidelity for domain tasks
  • Latency and throughput
  • Hallucination behaviour compared to fp16 baseline
  • Stability for long outputs

Common Issues

  • Model not loading: some models do not ship with quantization-compatible layers.
  • OOM errors: reduce sequence length or use smaller models.
  • Slow inference: CPU fallback indicates that GPU-specific packages were not installed correctly.

Summary

Quantization is a practical method to run small LLMs on constrained hardware. It reduces memory requirements, maintains acceptable accuracy, and integrates cleanly with the transformers ecosystem. This workflow provides a reproducible baseline for deploying quantized models in local environments.

Comments

Popular posts from this blog

Run Visual Studio Code Natively on Termux Proot Ubuntu or Other Linux Distribution

CPU Temperature Guide for Intel Core 2 Duo : Range of Normal CPU Temperatures

Windows 8 on Acer Aspire One AOA 150 - a 4 year old netbook