March/24 Highlights; External Health Checks for Clickhouse using Prometheus / Switching to Spot Instances & More

Apr 05, 2024

In March our infrastructure management team navigated through 79 tasks, each aimed at enhancing performance, reducing costs, and ensuring robust security measures. This article delves into the significant work accomplished, highlighting the team's proactive approach to managing different infrastructures for our partners.

Cloud Cost Optimization and Resource Management

Last month, a main goal was to cut costs for one of our clients. The team moved to spot instances, which greatly reduced the operating costs. They also did a thorough check and got rid of unnecessary CloudWatch metrics, logs, and alarms, making the use of resources more efficient.


Related Tasks


  1. (DMVP-1192): Transitioning to spot instances to leverage cost savings in cloud computing resources.
  2. (DMVP-3526): Review CloudWatch (metrics, logs, alarms) and delete unused resources
  3. (DMVP-3534): Identifying and removing unused Elastic Block Store (EBS) volumes to free up resources and minimize expenses.
  4. (DMVP-3678): Investigating cost-saving options provided by AWS to optimize spending on cloud services.

Database Management and Migration

The team moved MySQL and ElasticSearch/OpenSearch databases to new Azure systems for one of our partners, making sure the switch was smooth without losing any data quality. They also made important updates and improvements, like setting up ClickHouse and making MongoDB indexes better.


Related Tasks


  1. (DMVP-2987): Finalize Prefect Deployment
  2. (DMVP-3601): Moving an older MySQL database to a new Azure-based setup
  3. (DMVP-3602): Transferring ElasticSearch/OpenSearch databases from an AWS-based setup to a new Azure Kubernetes environment
  4. (DMVP-3603): Moving data from MongoDB to OpenSearch as part of a migration from an old OTC-Nomad setup to Azure AKS
  5. (DMVP-3652): Upgrading the MongoDB cluster
  6. (DMVP-3671): Integrating MongoDB Atlas with monitoring tools
  7. (DMVP-3680): Updating the version of the Relational Database Service (RDS)


Monitoring and System Health

For another partner our team improved system health with better monitoring methods. Establishing external health checks and adding Prometheus for ClickHouse metrics tracking were crucial. These efforts gave us a detailed view of how the system was performing and allowed us to spot and fix problems early, making sure services ran without any interruptions.


Related Tasks


  1. (DMVP-3547): Fine-tuning alarms to better monitor latency and traffic, improving system health monitoring.
  2. (DMVP-3595): Implementing external health checks to monitor the system's status and ensure its reliability.
  3. (DMVP-3642): Using Prometheus to monitor ClickHouse metrics, providing detailed insights into database performance and health.
  4. (DMVP-3649): Troubleshooting and fixing the external health checks and monitoring systems to ensure they function correctly.
  5. (DMVP-3653): Examining the process for manually uploading training data to identify and resolve potential issues, contributing to overall system health.
  6. (DMVP-3708): Establishing alerts for ClickHouse read-only mode incidents to proactively manage and resolve database health issues.
  7. (DMVP-3712): Communicating with BI/Data Science team to gain insights into the setup, aiding in better system health and performance monitoring.
  8. (DMVP-3774): Regular infrastructure support activities to maintain system health and address any arising issues promptly.


Deployment and Configuration Enhancements

Moving services over to Kubernetes and setting up new cloud environments made the deployment process smoother and faster for another partner. We also introduced autoscaling for the workers, which means their system can automatically adjust to handle more or less work as needed. Plus, we improved  CI/CD pipelines. These changes have made their operations better, allowing us to manage their infrastructure in a more flexible and quick-reacting way.


Related Tasks


  1. (DMVP-3668): Deploying Kubernetes Certificate Authority to production EKS, enhancing deployment security and configuration.
  2. (DMVP-3655): Centralizing the Docker container registry for more streamlined and efficient deployment processes.
  3. (DMVP-3656): Transitioning from Gitlab to Kaniko for container builds, optimizing the deployment pipeline.
  4. (DMVP-3705): Implementing Horizontal Pod Autoscaling for workers to improve resource management during deployments.
  5. (DMVP-3704): Deploying Clickhouse in a new cloud account, part of setting up and optimizing new environments.
  6. (DMVP-3797): Deploying new services, demonstrating the ongoing efforts to enhance deployment practices and configuration management.


Security and Compliance

By keeping a close eye on security and regularly checking and improving MongoDB database setups, we made sure our client's systems were safe and met official standards.


Related Tasks


  1. (DMVP-3639): Renewing and managing licenses for open-source projects to ensure compliance.
  2. (DMVP-3650): Discussing and implementing improvements in MongoDB indexes to enhance security and performance.
  3. (DMVP-3652): Upgrading MongoDB clusters for better security and efficiency.


Those were the key achievements in March for different clients.


Stay tuned for the April updates and let us know if you need our cloud infrastructure management best practices.

Share by: