Running Your First Local LLM on a 6–8 GB GPU: A Scientific Guide to Small Models

Running Your First Local LLM on a 6–8 GB GPU: A Scientific Guide to Small Models

Synopsis: This guide describes practical, reproducible steps to run a compact language model (2B–7B parameter class) on a consumer GPU with ~6–8 GB VRAM. It focuses on minimal dependencies, quantization for memory reduction, and objective benchmarking so you get useful output while preserving reproducibility and safety.

Why this approach works

Large models (tens to hundreds of billions of parameters) require large memory and specialized hardware. Smaller models (2B–7B) combined with quantization (4-bit or 8-bit) and device mapping permit reasonable latency and task utility on 6–8 GB GPUs. The underlying scientific principles are:

  • Model scaling law tradeoffs: smaller models have less representational capacity but are still effective for many tasks when used with retrieval or fine-tuned heads.
  • Quantization: reduces the memory footprint by representing weights with fewer bits; well-tested quantization schemes preserve enough numeric fidelity for many inference tasks.
  • Device mapping + offloading: strategically placing tensors on GPU vs CPU extends effective usable model size.

Prerequisites

RequirementMinimum
GPUNVIDIA 6–8 GB (e.g., RTX 3060 6GB)
OSUbuntu 20.04 / 22.04 or similar
Disk~10–20 GB free (model weights & cache)
Python3.10–3.11
Driver / CUDAMatching NVIDIA driver and CUDA toolkit (check nvidia-smi)

High-level steps

  1. Verify GPU & system tooling.
  2. Create an isolated Python environment.
  3. Install a minimal inference stack (PyTorch, Transformers, accelerate, bitsandbytes).
  4. Choose a compact model (2B–7B) and use a 4-bit/8-bit quantized load path.
  5. Run a basic inference; measure latency and token throughput.

Step A — Quick checks (run in terminal)

# Check GPU presence and driver
nvidia-smi

# Check Python
python3 --version

# Make sure you have pip
pip3 --version
Note: If nvidia-smi is missing, the system either lacks NVIDIA drivers or the GPU is not recognized. Resolve drivers first; inference on GPU requires drivers and CUDA-compatible PyTorch.

Step B — Automated setup script

Use the following script to create a venv and install a minimal set of packages for GPU inference. Copy-paste the block below into your Ubuntu terminal to create a file setup_llm.sh, make it executable, and run it.

cat <<'Eof' setup_llm.sh
#!/usr/bin/env bash
set -euo pipefail
PYVER=3.10
VENV_DIR=llm-venv

echo "Creating Python virtual environment in ./${VENV_DIR} (Python ${PYVER})"
python3 -m venv ${VENV_DIR}
source ${VENV_DIR}/bin/activate

echo "Upgrading pip and installing core packages"
pip install --upgrade pip setuptools wheel

# Install commonly used inference libraries.
# NOTE: If your CUDA/PyTorch setup requires a specific torch wheel, install that separately.
pip install "transformers>=4.35.0" accelerate bitsandbytes safetensors sentencepiece

# Install PyTorch - user may want a CUDA specific wheel.
# This attempts to install a recent stable torch; if you need CUDA-specific wheel,
# replace this line with the official command from PyTorch's website for your CUDA version.
pip install "torch>=2.1.0"

echo "Setup complete. Activate with: source ${VENV_DIR}/bin/activate"
Eof

If your system needs a CUDA-specific PyTorch binary, replace the pip torch install with the official wheel command from your PyTorch provider (the script above installs a generic wheel which may be CPU-only on some systems).

Step C — Minimal inference script (4-bit / 8-bit load path)

Below is a minimal Python program that loads a compact model with quantization settings where supported, issues a short prompt, and prints results. Save it as run_llm.py in the same directory as the venv or where you will run it.

cat <<'Eof' run_llm.py
#!/usr/bin/env python3
"""
Minimal inference example for a small LLM on constrained GPU.
This uses the huggingface transformers API and bitsandbytes quantization where available.
Adjust model_name to a quantized-capable small model (2B-7B). Ensure your venv is active.
"""
import time
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig  # if available in your transformers distribution
import torch

# === USER CONFIG ===
model_name = "your-compact-model-identifier"  # replace with model repo id (HuggingFace-style)
prompt = "Summarize the clinical significance of elevated ALT in one short paragraph."
max_new_tokens = 120
device = "cuda" if torch.cuda.is_available() else "cpu"

# === Quantization / bnb config ===
# The code below attempts a 4-bit load where supported. If BitsAndBytesConfig is unavailable
# in your transformers version, fall back to a standard load.
bnb_config = None
try:
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
    )
except Exception:
    bnb_config = None

print(f"Device: {device}")
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

print("Loading model (this may take time and memory)...")
load_kwargs = {"torch_dtype": torch.float16, "device_map": "auto"}
if bnb_config is not None:
    load_kwargs["quantization_config"] = bnb_config

model = AutoModelForCausalLM.from_pretrained(model_name, **load_kwargs)

print("Warmup and inference...")
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
start = time.time()
with torch.no_grad():
    outputs = model.generate(input_ids, max_new_tokens=max_new_tokens, do_sample=False)
end = time.time()

result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("=== OUTPUT ===")
print(result)
print(f"\nInference time: {end - start:.2f} s") 
Eof

Step D — Run the scripts

  1. Make scripts executable and run setup:
    chmod +x setup_llm.sh
    ./setup_llm.sh
  2. Activate the virtual environment and run inference (edit run_llm.py to set model_name first):
    source llm-venv/bin/activate
    python run_llm.py

Choosing a model (practical advice)

Prefer compact, actively maintained models with good community support and quantized checkpoints. Look for:

  • Model size 2B–7B parameters: best balance of capability vs memory.
  • Availability of quantized weights or compatibility with bitsandbytes / 4-bit load paths.
  • Clear license allowing local use.

Benchmarking & evaluation

Measure two objective metrics after you run an inference:

MetricHow to measure
Latency (s)Record wall time from token input to first token produced (use time.time() around model.generate()).
Tokens/secCompute total tokens produced divided by inference time; enables comparison across models.

Common pitfalls & troubleshooting

  • Out of memory: reduce max_new_tokens, use device_map="auto" so tensors can offload to CPU, or switch to a smaller model.
  • Incorrect torch wheel: if PyTorch is CPU-only, GPU inference will not work. Install the appropriate CUDA-enabled torch wheel for your CUDA version.
  • Tokenizer mismatch: ensure the tokenizer matches the model registry id.
Tip: If inference is still memory constrained, consider a CPU-based pipeline using an ONNX-exported model and an optimized runtime. This trades latency for feasibility on very small GPUs.

Safety, reproducibility, and notes for clinicians

When using LLMs in clinical contexts, treat outputs as assistive rather than authoritative. Always validate model responses against primary sources. Track the exact model identifier, quantization state, and package versions used for any experiment; this ensures reproducibility and auditability.

Short reference list (authoritative projects)

  • Hugging Face — model hub & Transformers library
  • bitsandbytes — memory-efficient quantized inference
  • PyTorch — primary deep learning runtime
  • ONNX & ONNX Runtime — optimized CPU inference pathway
  • GGML / llama.cpp — CPU-first lightweight inference projects

Summary

This workflow gives a pragmatic path: confirm GPU basics, prepare an isolated Python environment, install a minimal inference stack, choose a compact model, and use quantized loading plus device mapping to run credible local inference on 6–8 GB GPUs. The process emphasizes measurable metrics (latency, tokens/sec) and reproducibility.

Comments

Popular posts from this blog

Run Visual Studio Code Natively on Termux Proot Ubuntu or Other Linux Distribution

CPU Temperature Guide for Intel Core 2 Duo : Range of Normal CPU Temperatures

Windows 8 on Acer Aspire One AOA 150 - a 4 year old netbook