Building a Lightweight Local RAG System: A Practical Workflow

This article outlines a reproducible method to build a simple retrieval-augmented generation (RAG) system on a constrained machine. The goal is to combine compact embeddings, a minimal vector index, and a small quantized language model to create a functional question–answer pipeline.

Objective

Create a local RAG setup that runs efficiently on CPU or on a small GPU (6–8 GB), with predictable latency and no external services. The workflow avoids large dependencies and focuses on core components only.

System Requirements

Component	Minimum
CPU	Any modern laptop
GPU (optional)	6–8 GB VRAM
Python	3.10 or 3.11
Disk	2–3 GB free

Architecture Overview

Embedding Model: small CPU-friendly model for document vectorization
Index: lightweight FAISS or SQLite-based store
LLM: 4-bit quantized model for question answering
Pipeline: retrieve → format → generate

Environment Setup

cat <<'Eof' setup_rag.sh
#!/usr/bin/env bash
set -euo pipefail

VENV_DIR=rag-venv
python3 -m venv ${VENV_DIR}
source ${VENV_DIR}/bin/activate

pip install --upgrade pip setuptools wheel

pip install faiss-cpu numpy pandas sentence-transformers \
            transformers accelerate bitsandbytes safetensors

echo "Environment ready."
echo "Activate with: source ${VENV_DIR}/bin/activate"
Eof

Run the setup:

chmod +x setup_rag.sh
./setup_rag.sh

RAG Script

The script below loads documents, embeds them, constructs a FAISS index, retrieves the top-k passages for a query, and uses a small LLM for generation. Adapt paths and model identifiers as required.

cat <<'Eof' rag_pipeline.py
#!/usr/bin/env python3

import os
import numpy as np
import pandas as pd
import faiss
import torch

from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# ===== USER SETTINGS =====
docs_path = "documents.txt"       # one paragraph per line
embedding_model_id = "sentence-transformers/all-MiniLM-L6-v2"
llm_model_id = "your-small-llm"   # replace with a 2B–7B quantizable model
top_k = 3
# =========================

# Load documents
with open(docs_path, "r") as f:
    docs = [line.strip() for line in f if line.strip()]

# Embedding model
embed_model = SentenceTransformer(embedding_model_id)
embeddings = embed_model.encode(docs, convert_to_numpy=True)

# Build FAISS index
dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(embeddings)

# LLM setup
try:
    bnb = BitsAndBytesConfig(load_in_4bit=True)
    llm = AutoModelForCausalLM.from_pretrained(
        llm_model_id,
        quantization_config=bnb,
        device_map="auto"
    )
except Exception:
    llm = AutoModelForCausalLM.from_pretrained(
        llm_model_id,
        torch_dtype=torch.float16,
        device_map="auto"
    )

tokenizer = AutoTokenizer.from_pretrained(llm_model_id)

def answer(query):
    # Encode query
    q_emb = embed_model.encode([query], convert_to_numpy=True)
    _, idx = index.search(q_emb, top_k)
    retrieved = "\n".join([docs[i] for i in idx[0]])

    prompt = f"Context:\n{retrieved}\n\nQuestion: {query}\nAnswer:"

    inputs = tokenizer(prompt, return_tensors="pt").to(llm.device)
    with torch.no_grad():
        out = llm.generate(inputs["input_ids"], max_new_tokens=150)
    return tokenizer.decode(out[0], skip_special_tokens=True)

# Demo
if __name__ == "__main__":
    print("RAG system loaded. Enter a query.")
    while True:
        q = input("> ").strip()
        if not q:
            continue
        print(answer(q))
Eof

Run the RAG System

source rag-venv/bin/activate
python rag_pipeline.py

Document Format

Create a simple file named documents.txt with one paragraph per line. Concise text improves retrieval quality. For larger collections, pre-process documents into smaller chunks (150–250 words).

Performance Notes

Embedding throughput: MiniLM-based embeddings run efficiently on CPU.
Index size: FAISS stores vectors in RAM; suitable for thousands of documents on a laptop.
LLM latency: quantized 2B–4B models provide manageable inference times.

Evaluation

RAG quality depends on retrieval accuracy. To assess performance:

Review top-k results manually for relevance
Measure recall@k if labelled data is available
Track failure cases: missing context, over-broad answers, hallucination risks

Common Issues

High latency: reduce max_new_tokens or switch to a smaller LLM.
Low retrieval relevance: split documents into smaller chunks.
RAM pressure: prefer smaller embedding models or compress vectors (PCA).

Summary

This workflow provides a functional, reproducible local RAG system using small embeddings, a minimal index, and a compact LLM. It is suitable for constrained hardware and supports transparent evaluation and iterative improvement.

Search This Blog

Hipatic