Building a Lightweight Local RAG System: A Practical Workflow

Building a Lightweight Local RAG System: A Practical Workflow

This article outlines a reproducible method to build a simple retrieval-augmented generation (RAG) system on a constrained machine. The goal is to combine compact embeddings, a minimal vector index, and a small quantized language model to create a functional question–answer pipeline.

Objective

Create a local RAG setup that runs efficiently on CPU or on a small GPU (6–8 GB), with predictable latency and no external services. The workflow avoids large dependencies and focuses on core components only.

System Requirements

ComponentMinimum
CPUAny modern laptop
GPU (optional)6–8 GB VRAM
Python3.10 or 3.11
Disk2–3 GB free

Architecture Overview

  • Embedding Model: small CPU-friendly model for document vectorization
  • Index: lightweight FAISS or SQLite-based store
  • LLM: 4-bit quantized model for question answering
  • Pipeline: retrieve → format → generate

Environment Setup

cat <<'Eof' setup_rag.sh
#!/usr/bin/env bash
set -euo pipefail

VENV_DIR=rag-venv
python3 -m venv ${VENV_DIR}
source ${VENV_DIR}/bin/activate

pip install --upgrade pip setuptools wheel

pip install faiss-cpu numpy pandas sentence-transformers \
            transformers accelerate bitsandbytes safetensors

echo "Environment ready."
echo "Activate with: source ${VENV_DIR}/bin/activate"
Eof

Run the setup:

chmod +x setup_rag.sh
./setup_rag.sh

RAG Script

The script below loads documents, embeds them, constructs a FAISS index, retrieves the top-k passages for a query, and uses a small LLM for generation. Adapt paths and model identifiers as required.

cat <<'Eof' rag_pipeline.py
#!/usr/bin/env python3

import os
import numpy as np
import pandas as pd
import faiss
import torch

from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# ===== USER SETTINGS =====
docs_path = "documents.txt"       # one paragraph per line
embedding_model_id = "sentence-transformers/all-MiniLM-L6-v2"
llm_model_id = "your-small-llm"   # replace with a 2B–7B quantizable model
top_k = 3
# =========================

# Load documents
with open(docs_path, "r") as f:
    docs = [line.strip() for line in f if line.strip()]

# Embedding model
embed_model = SentenceTransformer(embedding_model_id)
embeddings = embed_model.encode(docs, convert_to_numpy=True)

# Build FAISS index
dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(embeddings)

# LLM setup
try:
    bnb = BitsAndBytesConfig(load_in_4bit=True)
    llm = AutoModelForCausalLM.from_pretrained(
        llm_model_id,
        quantization_config=bnb,
        device_map="auto"
    )
except Exception:
    llm = AutoModelForCausalLM.from_pretrained(
        llm_model_id,
        torch_dtype=torch.float16,
        device_map="auto"
    )

tokenizer = AutoTokenizer.from_pretrained(llm_model_id)

def answer(query):
    # Encode query
    q_emb = embed_model.encode([query], convert_to_numpy=True)
    _, idx = index.search(q_emb, top_k)
    retrieved = "\n".join([docs[i] for i in idx[0]])

    prompt = f"Context:\n{retrieved}\n\nQuestion: {query}\nAnswer:"

    inputs = tokenizer(prompt, return_tensors="pt").to(llm.device)
    with torch.no_grad():
        out = llm.generate(inputs["input_ids"], max_new_tokens=150)
    return tokenizer.decode(out[0], skip_special_tokens=True)

# Demo
if __name__ == "__main__":
    print("RAG system loaded. Enter a query.")
    while True:
        q = input("> ").strip()
        if not q:
            continue
        print(answer(q))
Eof

Run the RAG System

source rag-venv/bin/activate
python rag_pipeline.py

Document Format

Create a simple file named documents.txt with one paragraph per line. Concise text improves retrieval quality. For larger collections, pre-process documents into smaller chunks (150–250 words).

Performance Notes

  • Embedding throughput: MiniLM-based embeddings run efficiently on CPU.
  • Index size: FAISS stores vectors in RAM; suitable for thousands of documents on a laptop.
  • LLM latency: quantized 2B–4B models provide manageable inference times.

Evaluation

RAG quality depends on retrieval accuracy. To assess performance:

  • Review top-k results manually for relevance
  • Measure recall@k if labelled data is available
  • Track failure cases: missing context, over-broad answers, hallucination risks

Common Issues

  • High latency: reduce max_new_tokens or switch to a smaller LLM.
  • Low retrieval relevance: split documents into smaller chunks.
  • RAM pressure: prefer smaller embedding models or compress vectors (PCA).

Summary

This workflow provides a functional, reproducible local RAG system using small embeddings, a minimal index, and a compact LLM. It is suitable for constrained hardware and supports transparent evaluation and iterative improvement.

Comments

Popular posts from this blog

Run Visual Studio Code Natively on Termux Proot Ubuntu or Other Linux Distribution

CPU Temperature Guide for Intel Core 2 Duo : Range of Normal CPU Temperatures

Windows 8 on Acer Aspire One AOA 150 - a 4 year old netbook