Week #29: Infra, Simplified: Weekly Wrap
This week, the team focused heavily on enhancing our Kubernetes environments, streamlining observability with Grafana and Loki, and fine-tuning scaling automation to ensure peak efficiency and reliability. We've prioritized security, cost efficiency, and clarity in monitoring—paving the way for a more stable and scalable infrastructure.
Unified Kubernetes Version (EKS 1.31): Upgraded multiple clusters to Kubernetes 1.31 on Amazon EKS. Regularly updating our Kubernetes version helps us leverage enhanced security features, better resource management, and continued AWS support. It simplifies cluster management and ensures compatibility with the latest cloud-native tools.
Full Migration to Grafana & Loki: Successfully transitioned key workloads from CloudWatch to Grafana and Loki. Grafana offers superior flexibility for dashboards and real-time metrics, making troubleshooting faster. Loki dramatically reduces costs and complexity compared to traditional log management tools by indexing only essential metadata, giving us quick searches and efficient storage.
Event Exporter for Kubernetes: Integrated Kubernetes Event Exporter across clusters, capturing important but transient events that were previously easy to miss. This gives us real-time awareness of cluster activities and significantly improves debugging and incident response.
Reduced Monitoring Noise: Optimized alert configurations, cutting down noisy alerts by roughly 30%, ensuring our on-call team only sees meaningful, actionable alerts.
Enhanced Autoscaling with KEDA: Improved worker queue scaling logic using Kubernetes-based Event-Driven Autoscaling (KEDA). KEDA efficiently adjusts pod numbers based on queue length or workload metrics, helping us smoothly handle workload spikes without manual intervention—resulting in reduced operational costs and improved performance.
Queue Workers Optimized: Adjusted scaling thresholds on key background workers, significantly improving queue processing times and ensuring critical tasks complete quickly.
Resolved CI/CD Bottlenecks: Addressed CI/CD pipeline issues causing build stalls and deployment failures. Streamlined pipeline performance means less waiting for developers, quicker feature releases, and smoother deployments.
CORS & Connectivity Issues Fixed: Patched frontend-to-backend communication issues in staging environments, improving development velocity and frontend team efficiency.
Logging and Tracing Improved: Standardized trace IDs and logs across services, making distributed tracing and debugging more straightforward and effective.
Resource Requests Configured: Explicitly defined CPU/memory resource requests for several pods to prevent unexpected restarts and ensure predictable resource allocation.
We observed an uptick in 5xx errors on certain critical API endpoints. Initial investigations suggest this might be tied to recent autoscaling adjustments. Our engineers are actively working to pinpoint and resolve the issue promptly.
Storage Class Optimization (gp2 Volumes): Changing the default PersistentVolumeClaims from general-purpose storage (gp2) to a more predictable performance option to enhance reliability and cost-effectiveness.
Granular Cost Allocation: Detailed review of AWS spend across microservices to ensure cost transparency and identify optimization opportunities.
Monitoring Refinements: Addressing remaining Loki logging issues and further enhancing Grafana dashboards for even clearer insights.
Observability Team: Big kudos for significantly reducing unnecessary alerts and improving the signal-to-noise ratio, resulting in calmer, more focused on-call rotations.
Slack Wisdom: Special mention to the spontaneous Slack deep-dive into KEDA autoscaling—team-driven knowledge sharing at its best!
Lesson Learned: The repeated success of our Kubernetes upgrades underscores the huge payoff from automated, repeatable processes. Efficiency and automation for the win!