Homelab Kubernetes Platform (RKE2 + Rancher)

The sovereign Kubernetes platform my AI work runs on — a production-style RKE2 cluster on Proxmox with full GitOps (Argo CD), a high-availability control plane, restore-drilled disaster recovery, Grafana 12 observability, encrypted secrets (SOPS/age), and a self-hosted LLM gateway + local inference layered on top.

Proxmox VELinuxKubernetes (RKE2)RancherArgo CDGrafana 12VeleroPi-holeSOPS/ageMCP ServersAnsible

Overview

This homelab is the sovereign platform my AI work runs on — a production-style Kubernetes cluster on Proxmox (RKE2 + Rancher) that I build, run, and break myself. It is the substrate: a self-hosted LLM gateway, local inference, and retrieval services all sit on top of it, alongside the usual stateful apps.

Everything is managed as code and reconciled through GitOps (Kustomize + Argo CD) — desired state lives in Git, changes are reviewed and reproducible, and drift becomes visible and correctable. Internal services resolve through split-horizon DNS to stable hostnames instead of NodePorts.

What I care about most here is the boring, load-bearing part: a high-availability control plane (multi-node etcd), disaster recovery that is actually restore-drilled rather than just assumed, encrypted secrets, and scoped access. It is, in the end, infrastructure and operations — but it is the kind you can trust an AI platform to live on.

Homepage View

What this demonstrates

  • Practical Kubernetes operations: deployments, services, ingress, PVCs, probes, rollouts
  • DNS + ingress as the stable interface for internal services
  • Storage topology awareness: designing around node-local persistence (local-path, WaitForFirstConsumer)
  • GitOps foundations: Kustomize structure + Argo CD reconciliation, diffs, and controlled sync
  • Three-layer backup strategy (Proxmox + Velero + etcd)
  • Encrypted secrets management with SOPS/age
  • SSH-hardened infrastructure with scoped access controls
  • Grafana 12 observability with Prometheus metrics
  • MCP server integration for AI-assisted cluster operations
  • Substrate for AI workloads: a self-hosted LLM gateway (model routing, SSO/RBAC, cost attribution), local inference, and a retrieval (RAG) service running on the same platform
  • Security hardening by default: OIDC single sign-on (Keycloak), default-deny access, key-only SSH, and SOPS/age-encrypted secrets

Operational Practices

Reliability & recovery

  • Proxmox snapshots before risky changes
  • Velero + Kopia for Kubernetes workload backup and restore
  • etcd snapshots synced to NAS on schedule
  • Documented rollback procedures

Observability

  • Grafana 12 dashboards for cluster health and resource usage
  • Prometheus metrics across all workloads
  • Custom cost attribution dashboards

Security

  • Key-only SSH authentication across all nodes
  • Scoped kubeconfigs for different access levels
  • Secrets encrypted at rest
  • MCP servers scoped to read-only at every layer
  • OIDC single sign-on (Keycloak) with role-mapped access to platform services

Share this project

Share: