Migrate to New Kubernetes Cluster
Published: 2026-04-19 | Section: Cloud Engineering | Author: Amirreza Rezaie
From K3s to production-grade infrastructure: a complete migration journey with Cilium, Longhorn, and Vault.
Introduction
Many months ago, when I first created Goalixa’s base services, I used a K3s cluster for infrastructure. At that time, my goal was simply to get a monolithic service up and running and expose it on the internet. I didn’t care much about what happened at the infrastructure level — I just needed something that worked.
But now everything has changed. Goalixa has evolved into a system with multiple critical components and microservices, capable of serving real production traffic. As an SRE engineer, I know that reliability is paramount in this business. I need infrastructure that is manageable, scalable, and highly configurable.
This post documents the complete journey of migrating from a simple K3s setup to a production-grade Kubernetes cluster — covering networking, storage, secrets management, and the lessons learned along the way.
The Problems
Why K3s Was No Longer Enough
While K3s served its purpose in the early days, it introduced several critical issues as the system grew:
Storage Problems: In K3s, I used local storage for persistent values. When I drained and removed a node from the cluster, I lost all data from that node. One night, I drained a node from the cluster, and when I checked Grafana the next morning, everything was gone. Every graph, every metric — simply disappeared. This was a wake-up call.
Network Limitations: The default networking in K3s wasn’t sufficient for the level of observability and control I needed in production. I required advanced networking features, better security, and more granular control over traffic.
Scalability Concerns: As the number of microservices grew, I needed a more robust infrastructure that could scale horizontally and handle failures gracefully without data loss.
Critical Concerns
The most important concern during this migration was data integrity. I couldn’t afford to lose any data during the transition. Every database, every persistent volume, every piece of configuration needed to be carefully migrated.
Additionally, I had accumulated secrets over time — API keys, database credentials, tokens, and certificates. Managing these securely during and after the migration was paramount.
Critical Issues Encountered
Before the migration was complete, I encountered several critical issues that highlighted why this work was necessary:
Grafana Not Available
One of the immediate problems was Grafana becoming unavailable. After a node drain, the metrics and monitoring infrastructure simply stopped working. This wasn’t acceptable for production — I needed persistent storage that could survive node failures.
BFF Service Image Pull Issue
When draining nodes, the BFF (Backend for Frontend) service experienced image pull failures. This caused complete business downtime. The root cause was related to how the cluster handled storage and pod scheduling during node maintenance.
These incidents confirmed that local storage and basic Kubernetes setup wouldn’t cut it for production reliability.
Network Level
Cilium for CNI
The first major decision was to replace the default CNI with Cilium. Cilium provides:
- Advanced Networking: eBPF-based networking for high performance
- Security: Network policies with identity-based visibility
- Observability: Built-in Hubble for network traffic analysis
- Scalability: Handles thousands of services and endpoints efficiently
Hubble Integration
Hubble provides deep observability into cluster networking. I configured Hubble to:
- Monitor all network traffic between services
- Identify latency bottlenecks
- Detect anomalies in communication patterns
- Provide detailed flow logs for troubleshooting
After setting up Cilium and Hubble, I gained visibility into how services communicated, which was invaluable for debugging and optimization.
NGINX Ingress Controller
With Cilium properly configured, I set up the NGINX Ingress Controller for managing ingress and traffic. This provides:
- Load Balancing: Distribute traffic across service replicas
- TLS Termination: Handle SSL/TLS certificates
- Path-based Routing: Route requests to appropriate services
- Rate Limiting: Protect services from traffic spikes
The combination of Cilium for pod-to-pod communication and NGINX Ingress for external traffic provides a complete networking solution.
Storage Level
Secret Management with Vault
HashiCorp Vault became the cornerstone of my secrets management strategy. The implementation covered:
- Centralized Secrets: All secrets stored in one place
- Dynamic Secrets: Generate credentials on-demand for databases
- Encryption at Rest: All secrets encrypted when stored
- Access Control: Fine-grained policies for different teams and services
Vault Setup in the New Cluster
I configured Vault using the official Helm chart with:
- High Availability: Running multiple Vault pods for redundancy
- Unseal Strategy: Using Kubernetes service-based auto-unseal
- Storage Backend: Using Longhorn for persistent Vault data
After setup, I migrated all existing secrets from the old cluster to Vault in the new cluster.
Longhorn for Persistent Storage
After evaluating options, I chose Longhorn for storage management:
- Distributed Block Storage: Replicates data across multiple nodes
- Data Safety: Replicas survive node failures
- Easy Management: Simple UI for volume management
- Backup Support: Built-in backup to external storage
Longhorn solved the critical issue of data persistence. With replicas spread across nodes, draining a node no longer means data loss.
Database Migration
I migrated all databases to the new cluster using Longhorn:
- PostgreSQL databases for Core-API
- Redis data for BFF caching
- Metrics data for Prometheus and Grafana
Each database got dedicated persistent volumes with appropriate replica counts.
Ceph Considerations
Initially, I considered Ceph for storage class. However, after evaluating Longhorn’s capabilities, I found it better suited for my use case. Longhorn provides:
- Simpler deployment and management
- Adequate performance for my workloads
- Excellent integration with Kubernetes
- Built-in backup functionality
CI/CD
ArgoCD for GitOps
ArgoCD was already set up in the new cluster, providing:
- Declarative Deployments: All configurations stored in Git
- Automated Sync: Changes automatically deployed to the cluster
- Health Monitoring: Visual feedback on application status
- Rollback Capability: Easy reversion to previous versions
The challenge was integrating Vault with ArgoCD for secrets management. I needed to:
- Configure ArgoCD to authenticate with Vault
- Update applications to read secrets from Vault
- Ensure seamless secret injection during deployments
Harbor for Container Registry
Harbor remains a critical component for storing and managing container images. The migration plan includes setting up Harbor in the new cluster, though this is scheduled for the final phase of migration.
Migration Process
Phase 1: Infrastructure Setup (Completed)
- Configure Cilium CNI
- Set up Hubble for observability
- Install NGINX Ingress Controller
- Deploy Longhorn for persistent storage
- Configure Vault for secrets management
- Set up ArgoCD
Phase 2: Database Migration (Completed)
- Create backups from old cluster
- Migrate databases to new cluster with Longhorn storage
- Verify data integrity after migration
- Test database connectivity from applications
Phase 3: Application Migration (In Progress)
- Deploy applications using ArgoCD
- Reduce load from old cluster
- Switch traffic to new cluster
- Verify all services are functioning
Phase 4: Cluster Transition (Planned)
- Remove nodes from old K3s cluster
- Clean up old nodes
- Join new nodes to the production cluster
- Scale down old cluster completely
Phase 5: Harbor Migration (Planned)
- Set up Harbor in new cluster
- Migrate container images
- Update image references in ArgoCD
Secrets Management Deep Dive
Learning Vault
Working with Vault required a significant learning curve. Key concepts I had to master:
Secret Engines: Different backends for different types of secrets (KV, Database, AWS, etc.)
Authentication Methods: Various ways to authenticate with Vault (Kubernetes, GitHub, userpass, etc.)
Policies: Defining who can access what secrets
Transit Secrets: Encryption as a service for application data
Integration with Applications
The pattern I’m implementing:
- Applications authenticate using Kubernetes ServiceAccount
- Vault validates the ServiceAccount against Kubernetes
- Applications read secrets via Vault API or sidecar injector
- Secrets are automatically rotated when needed
This provides a clean separation between application code and secrets management.
Lessons Learned
1. Storage is Fundamental
Local storage was the root cause of most data loss incidents. Moving to distributed storage (Longhorn) was essential for production reliability.
2. Plan for Failure
Every component should be designed to handle failure gracefully. Replication, backups, and automatic failover are not optional in production.
3. Observability First
Before making any infrastructure changes, ensure you can see what’s happening. Hubble, Prometheus, and Grafana are essential tools.
4. Secrets Are Critical
Invest time in proper secrets management from the start. Retrofitting Vault is harder than building it in from the beginning.
5. Migration Requires Patience
Rushing the migration risks data loss. Each phase needs thorough testing before proceeding to the next.
What’s Next
The migration is ongoing. Current priorities:
- Complete application migration to new cluster
- Configure Vault integration with all services
- Set up Harbor in the new cluster
- Migrate remaining workloads
- Decommission old K3s cluster
This journey has transformed Goalixa’s infrastructure from a simple cluster to a production-ready system capable of serving real traffic with confidence.