Case Study

Stabilizing and Scaling a B2B SaaS Chatbot Powered by Large Language Models

Kauz.ai specializes in creating intelligent chatbot solutions that enhance customer engagement for businesses worldwide. Company developed a chatbot that uses Large Language Models (LLMs) to provide real-time responses by drawing on knowledge bases (KBs) from various documentation sources. Their platform needed to handle document processing at scale, ensure prompt responses, and deliver a seamless user experience. However, issues with the underlying Kubernetes (k8s) setup, along with a monolithic service structure, led to recurring stability challenges.
Challenge

1. Single-Container Monolith

• The chatbot’s entire functionality (including Celery workers, queues, and application logic) ran in a single container. • This setup hindered independent scaling and caused resource contention.

2. Improper Kubernetes Configuration

• Services were running without properly assigned resource requests and limits.

• Frequent crashes and restarts occurred unpredictably, destabilizing the platform.

3. Lack of Monitoring and APM

• No Application Performance Monitoring (APM) or structured logging was in place.

• Diagnosing performance bottlenecks or system failures was challenging.

4. Rising Costs and Limited Visibility

• Because resources were untracked and unbounded, unplanned costs began to pile up.

• There was no clear approach for scaling to meet workload demands efficiently.

What We Did

1. Service Separation and Kubernetes Best Practices

• We split the monolithic container into multiple deployments, isolating key services (application servers, Celery workers, queue managers).

• Implemented proper Kubernetes resource requests and limits to ensure predictable performance.

2. Introduction of Monitoring and APM

• Deployed monitoring tools and Application Performance Monitoring solutions to track system health in real time.

• Set up alerting to quickly identify and address issues before they affected end users.

3. Autoscaling and Colocation

• Enabled task/request-based autoscaling to scale each service independently.

• Where beneficial, used colocation of certain services to balance resource usage and reduce overhead.

4. Multi-Tenancy for Cost Efficiency

• Converted parts of the platform to be multi-tenant, allowing multiple customers to share underlying infrastructure.

• This approach helped distribute costs more effectively while preserving data isolation and performance.

5. Focus on LLM-Driven Processes

• Ensured that the LLM-based chatbot could seamlessly integrate newly ingested knowledge bases.

• Prepared the system for additional advanced features, such as voice conversation capabilities (an area we have experience with for other clients).

Results

• Improved Stability & Reliability: Separating services and configuring Kubernetes resources eliminated random crashes and restarts.

• Scalable Architecture: Independent autoscaling ensures the chatbot and background workers can respond to changing workloads without over-

allocating resources.

• Enhanced Observability: With real-time monitoring and APM, performance issues can be quickly detected and addressed, leading to higher uptime

and better user satisfaction.

• Transparent Resource Usage & Reduced Costs: By defining resource requests and limits, the platform’s costs are more predictable. Multi-tenancy

and targeted autoscaling have further optimized infrastructure usage.

Next Steps

• Performance Testing: We plan to conduct more rigorous load and stress tests to validate the platform’s capabilities under peak demand.

• Further Job Splitting: Additional segmentation of the knowledge base processing tasks will enable even finer control over resource allocation, further

improving resilience.

• Ongoing Collaboration: Our team remains committed to supporting the client’s success, ensuring that the platform adapts to the evolving

requirements of LLM-based use cases.

Lamp Icon
Conclusion
Why This Matters for Companies Building LLM-Based Solutions Future-Ready Solutions
Scalable Architecture: As AI-driven products grow, we prioritize scalability and reliability.
Cross-Industry Impact: Principles like modularity, monitoring, and optimization apply across various platforms.
Customer-Centric Approach: We align with business goals, addressing challenges and celebrating successes.
By rethinking a B2B SaaS chatbot's architecture, we improved stability and reduced costs, ensuring every change supports our customers’ objectives through empathy and collaboration.
Loading calendar...