Stabilizing and Scaling a B2B SaaS Chatbot Powered by Large Language Models
1. Single-Container Monolith
• The chatbot’s entire functionality (including Celery workers, queues, and application logic) ran in a single container. • This setup hindered independent scaling and caused resource contention.
2. Improper Kubernetes Configuration
• Services were running without properly assigned resource requests and limits.
• Frequent crashes and restarts occurred unpredictably, destabilizing the platform.
3. Lack of Monitoring and APM
• No Application Performance Monitoring (APM) or structured logging was in place.
• Diagnosing performance bottlenecks or system failures was challenging.
4. Rising Costs and Limited Visibility
• Because resources were untracked and unbounded, unplanned costs began to pile up.
• There was no clear approach for scaling to meet workload demands efficiently.
1. Service Separation and Kubernetes Best Practices
• We split the monolithic container into multiple deployments, isolating key services (application servers, Celery workers, queue managers).
• Implemented proper Kubernetes resource requests and limits to ensure predictable performance.
2. Introduction of Monitoring and APM
• Deployed monitoring tools and Application Performance Monitoring solutions to track system health in real time.
• Set up alerting to quickly identify and address issues before they affected end users.
3. Autoscaling and Colocation
• Enabled task/request-based autoscaling to scale each service independently.
• Where beneficial, used colocation of certain services to balance resource usage and reduce overhead.
4. Multi-Tenancy for Cost Efficiency
• Converted parts of the platform to be multi-tenant, allowing multiple customers to share underlying infrastructure.
• This approach helped distribute costs more effectively while preserving data isolation and performance.
5. Focus on LLM-Driven Processes
• Ensured that the LLM-based chatbot could seamlessly integrate newly ingested knowledge bases.
• Prepared the system for additional advanced features, such as voice conversation capabilities (an area we have experience with for other clients).
• Improved Stability & Reliability: Separating services and configuring Kubernetes resources eliminated random crashes and restarts.
• Scalable Architecture: Independent autoscaling ensures the chatbot and background workers can respond to changing workloads without over-
allocating resources.
• Enhanced Observability: With real-time monitoring and APM, performance issues can be quickly detected and addressed, leading to higher uptime
and better user satisfaction.
• Transparent Resource Usage & Reduced Costs: By defining resource requests and limits, the platform’s costs are more predictable. Multi-tenancy
and targeted autoscaling have further optimized infrastructure usage.
• Performance Testing: We plan to conduct more rigorous load and stress tests to validate the platform’s capabilities under peak demand.
• Further Job Splitting: Additional segmentation of the knowledge base processing tasks will enable even finer control over resource allocation, further
improving resilience.
• Ongoing Collaboration: Our team remains committed to supporting the client’s success, ensuring that the platform adapts to the evolving
requirements of LLM-based use cases.