Case Study
The real-world traffic is not what you test for. Managing 15x more traffic than expected. Lessons learned.
Buycycle.com is a marketplace that sells more than 15,000 high-quality pre-owned bikes. They launched a TV advertising campaign that resulted in an unexpected 15-fold increase in website traffic. This significant traffic exposed deficiencies in the infrastructure setup, particularly in the autoscaling configuration. The infrastructure initially utilized a mix of Amazon EC2 (Elastic Compute Cloud) and EKS (Elastic Kubernetes Service). However, the static and dynamic workload distribution between EC2 and EKS was incorrect.
Challenge
1. The original setup overutilized Kubernetes tools, which, while robust, added unnecessary complexity and latency to the autoscaling process.
2. The misallocation between EC2 and EKS for handling static versus dynamic resources worsened the problem, as autoscaling mechanisms were not responding swiftly enough to the sudden increase in demand.
3. Initially, the monitoring system did not provide data about the specific inefficiencies and root causes of the scaling issue.
What We Did
To address these challenges, we accomplished the following;
We simplified the use of Kubernetes, removing unnecessary tools that affected autoscaling.
We implemented a solid monitoring system - alarms, health checks, logging, and tracing to provide precise insights into system performance and health.
Our infrastructure team worked closely with developers to optimize application performance and ensure that the backend could handle large spikes in traffic without service degradation.
We carefully reallocated the resources, focusing on better static and dynamic workload distribution between EC2 and EKS.
Lessons Learned
1. The real-world traffic is much more complex. Unlike organic growth, advertisement-driven traffic can spike unpredictably. Therefore, it is essential to test infrastructure against simulated spikes in traffic rather than just incremental increases to ensure the system can handle sudden growth.
2. Analyzing and understanding how users interact with the site during high-traffic scenarios is essential to designing more resilient and responsive systems.
3. Testing should consider server capabilities and the environment from which requests are made. Using a multi-source approach for load testing can help better understand the system's behavior.
4. Traditional autoscaling may not be the most effective strategy for specific workloads or peak traffic conditions.
5. Overload conditions caused the system to fail in responding to health checks and I/O operations, leading to misleading system health indicators.
Myasnik Manvelyan
VP of Engineering, Buycycle.com, Münich, Germany
Thanks to Das Meta’s cloud architecture and infrastructure management team’s approach to performance tuning, we now have a high-performing website with a high SLA. Following a TV ad campaign that spiked our site traffic 15 times higher than expected, we quickly realized our infrastructure setup wasn’t correct. Das Meta’s experts implemented a monitoring system and fine-tuned our resources, making sure our website works without downtime, whatever the traffic.
Conclusion
Literate infrastructure management ensures that organizations are prepared to handle unexpected challenges and capitalize on opportunities for growth and expansion.
This case study illustrates the importance of high-level expertise in literate, proper infrastructure management processes, where understanding and planning for the complexities of real-world scenarios is essential.
If your company faces similar challenges, our team is ready to help transform your cloud infrastructure management.
Reach out to learn more about our services and how we can assist you in achieving similar results.