Homelab Kubernetes Platform (RKE2 + Rancher)

A 4-node Proxmox cluster running RKE2 Kubernetes with full GitOps (Argo CD), three-layer backup system (Proxmox + Velero + etcd), Grafana 12 observability, encrypted secrets (SOPS/age), MCP server integration for AI-assisted ops, and SSH-hardened infrastructure — operated as a production-grade learning environment.

Proxmox VELinuxKubernetes (RKE2)RancherArgo CDGrafana 12VeleroPi-holeSOPS/ageMCP ServersAnsible

Overview

I run a production-style homelab Kubernetes platform on Proxmox using RKE2, with Rancher as the primary operator interface. Internal services are exposed via ingress-nginx and resolved through Pi-hole (*.homelab) so apps have stable hostnames instead of NodePorts.

Workloads and infrastructure are managed as code with Kustomize, and I’ve introduced Argo CD to move toward a true GitOps workflow: desired state lives in Git, changes are reviewed and reproducible, and cluster drift becomes visible and correctable.

Stateful services use PVC-backed persistence, with storage constraints (node-local affinity under local-path) explicitly documented. I validate platform patterns with small, real deployments (including a Train → Store → Serve “MLOps lab” workload) and capture runbooks and troubleshooting notes in an internal wiki.

Homepage View

What this demonstrates

  • Practical Kubernetes operations: deployments, services, ingress, PVCs, probes, rollouts
  • DNS + ingress as the stable interface for internal services
  • Storage topology awareness: designing around node-local persistence (local-path, WaitForFirstConsumer)
  • GitOps foundations: Kustomize structure + Argo CD reconciliation, diffs, and controlled sync
  • Three-layer backup strategy (Proxmox + Velero + etcd)
  • Encrypted secrets management with SOPS/age
  • SSH-hardened infrastructure with scoped access controls
  • Grafana 12 observability with Prometheus metrics
  • MCP server integration for AI-assisted cluster operations

Operational Practices

Reliability & recovery

  • Proxmox snapshots before risky changes
  • Velero + Kopia for Kubernetes workload backup and restore
  • etcd snapshots synced to NAS on schedule
  • Documented rollback procedures

Observability

  • Grafana 12 dashboards for cluster health and resource usage
  • Prometheus metrics across all workloads
  • Custom cost attribution dashboards

Security

  • Key-only SSH authentication across all nodes
  • Scoped kubeconfigs for different access levels
  • Secrets encrypted at rest
  • MCP servers scoped to read-only at every layer

Share this project

Share: