How We Reduced Failed Deployments by 99.4% and Cut Rollback Time to 4s with Pre-warmed Canaries and eBPF SLO Enforcement
Current Situation Analysis In Q3 2024, we managed 412 microservices across three K8s 1.31 clusters handling 140k RPS peak. Our standard deployment strategy was a RollingUpdate with maxSurge: 25% and maxUnavailable: 25%. On paper, this is safe. In production, it was a latency bomb.
