RAG Lab: Local LLM + Vector Search
A minimal, end-to-end Retrieval-Augmented Generation stack deployed via Argo CD: Qdrant stores embeddings, an indexer CronJob ingests documents from a PVC, and a FastAPI RAG service answers questions over HTTP through ingress-nginx using Ollama (Qwen + nomic embeddings).
Overview
This project is a deliberately small, end-to-end Index → Retrieve → Generate RAG system built to learn the real operational shape of "LLM apps" on Kubernetes — networking, storage topology, and observability — without hiding complexity behind managed platforms.
Documents are ingested on a schedule, chunked and embedded using Ollama, stored in Qdrant, and served through a lightweight FastAPI RAG API. The API is exposed via ingress-nginx and reachable by internal DNS, so it behaves like a real service in the homelab.
A key lesson: the cluster uses node-local persistence (local-path PVC), so the RAG workload is designed around storage constraints (PVC binding + node pinning). This makes failures obvious and debugging practical.

What it does
Index
- A scheduled CronJob ingests documents from a mounted PVC.
- Files are chunked, embedded with nomic-embed-text, and written to a Qdrant collection.
- The indexer is built to be re-runnable and observable (logs show chunk counts, embedding calls, and Qdrant writes).
Retrieve
- The RAG API embeds the user query and performs vector search in Qdrant.
- A debug endpoint shows the exact retrieved chunks, similarity scores, and retrieval latency to keep the system inspectable.
Generate
- Retrieved context is fed into a local chat model (qwen2.5:7b-instruct) via Ollama.
- The API returns:
- the answer
- sources (document IDs + similarity scores)
- latency breakdown (retrieval vs generation)
Upload + Reindex
- Documents can be uploaded into the docs PVC via an HTTP endpoint.
- A reindex endpoint enables on-demand rebuilds, while the CronJob provides steady background ingestion.
Expose
- Ingress routes Host: rag.homelab → Service → RAG API pods.
- Pi-hole provides stable internal DNS, so the service behaves like a first-class internal app.
Demonstrates
- GitOps workflow with Argo CD: desired state in Git, reconciled into the cluster with clear diff/health/sync visibility.
- LLM app plumbing: deterministic wiring from Ingress → Service → Pods → Qdrant → Ollama; no "magic", just traceable requests.
- Storage topology awareness: designed around local-path constraints (RWO volumes, node pinning, avoiding PVC scheduling traps).
- Operational failure modes: discovered and fixed real issues (client/server API mismatches, dependency drift, external dependency availability like a sleeping laptop).
- Observability basics: Prometheus-style metrics + latency breakdowns make performance regressions visible and debuggable.
- Runbook discipline: documented prerequisites (like keeping Ollama reachable), verification commands, and failure modes like a production service.
Notes
This isn't a "perfect" RAG system — it's a controlled lab that makes the operational reality of RAG visible. The goal is to understand exactly what breaks (networking, scheduling, storage, external model dependencies), how to measure it, and how to evolve it into something production-shaped over time.