Building a Lightweight Local RAG System: A Practical Workflow
Building a Lightweight Local RAG System: A Practical Workflow
This article outlines a reproducible method to build a simple retrieval-augmented generation (RAG) system on a constrained machine. The goal is to combine compact embeddings, a minimal vector index, and a small quantized language model to create a functional question–answer pipeline.
Objective
Create a local RAG setup that runs efficiently on CPU or on a small GPU (6–8 GB), with predictable latency and no external services. The workflow avoids large dependencies and focuses on core components only.
System Requirements
| Component | Minimum |
|---|---|
| CPU | Any modern laptop |
| GPU (optional) | 6–8 GB VRAM |
| Python | 3.10 or 3.11 |
| Disk | 2–3 GB free |
Architecture Overview
- Embedding Model: small CPU-friendly model for document vectorization
- Index: lightweight FAISS or SQLite-based store
- LLM: 4-bit quantized model for question answering
- Pipeline: retrieve → format → generate
Environment Setup
cat <<'Eof' setup_rag.sh
#!/usr/bin/env bash
set -euo pipefail
VENV_DIR=rag-venv
python3 -m venv ${VENV_DIR}
source ${VENV_DIR}/bin/activate
pip install --upgrade pip setuptools wheel
pip install faiss-cpu numpy pandas sentence-transformers \
transformers accelerate bitsandbytes safetensors
echo "Environment ready."
echo "Activate with: source ${VENV_DIR}/bin/activate"
Eof
Run the setup:
chmod +x setup_rag.sh
./setup_rag.sh
RAG Script
The script below loads documents, embeds them, constructs a FAISS index, retrieves the top-k passages for a query, and uses a small LLM for generation. Adapt paths and model identifiers as required.
cat <<'Eof' rag_pipeline.py
#!/usr/bin/env python3
import os
import numpy as np
import pandas as pd
import faiss
import torch
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
# ===== USER SETTINGS =====
docs_path = "documents.txt" # one paragraph per line
embedding_model_id = "sentence-transformers/all-MiniLM-L6-v2"
llm_model_id = "your-small-llm" # replace with a 2B–7B quantizable model
top_k = 3
# =========================
# Load documents
with open(docs_path, "r") as f:
docs = [line.strip() for line in f if line.strip()]
# Embedding model
embed_model = SentenceTransformer(embedding_model_id)
embeddings = embed_model.encode(docs, convert_to_numpy=True)
# Build FAISS index
dim = embeddings.shape[1]
index = faiss.IndexFlatL2(dim)
index.add(embeddings)
# LLM setup
try:
bnb = BitsAndBytesConfig(load_in_4bit=True)
llm = AutoModelForCausalLM.from_pretrained(
llm_model_id,
quantization_config=bnb,
device_map="auto"
)
except Exception:
llm = AutoModelForCausalLM.from_pretrained(
llm_model_id,
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(llm_model_id)
def answer(query):
# Encode query
q_emb = embed_model.encode([query], convert_to_numpy=True)
_, idx = index.search(q_emb, top_k)
retrieved = "\n".join([docs[i] for i in idx[0]])
prompt = f"Context:\n{retrieved}\n\nQuestion: {query}\nAnswer:"
inputs = tokenizer(prompt, return_tensors="pt").to(llm.device)
with torch.no_grad():
out = llm.generate(inputs["input_ids"], max_new_tokens=150)
return tokenizer.decode(out[0], skip_special_tokens=True)
# Demo
if __name__ == "__main__":
print("RAG system loaded. Enter a query.")
while True:
q = input("> ").strip()
if not q:
continue
print(answer(q))
Eof
Run the RAG System
source rag-venv/bin/activate
python rag_pipeline.py
Document Format
Create a simple file named documents.txt with one paragraph per line. Concise text improves retrieval quality. For larger collections, pre-process documents into smaller chunks (150–250 words).
Performance Notes
- Embedding throughput: MiniLM-based embeddings run efficiently on CPU.
- Index size: FAISS stores vectors in RAM; suitable for thousands of documents on a laptop.
- LLM latency: quantized 2B–4B models provide manageable inference times.
Evaluation
RAG quality depends on retrieval accuracy. To assess performance:
- Review top-k results manually for relevance
- Measure recall@k if labelled data is available
- Track failure cases: missing context, over-broad answers, hallucination risks
Common Issues
- High latency: reduce
max_new_tokensor switch to a smaller LLM. - Low retrieval relevance: split documents into smaller chunks.
- RAM pressure: prefer smaller embedding models or compress vectors (PCA).
Summary
This workflow provides a functional, reproducible local RAG system using small embeddings, a minimal index, and a compact LLM. It is suitable for constrained hardware and supports transparent evaluation and iterative improvement.
Comments
Post a Comment