Document describes how kafka should be monitored, what are components involved, what are metrics to be monitored, what are ones to configure to receive alarms and what are thresholds.
We might need to use prometheus exporter, but not sure yet, we should come up with metrics to monitor and then see which method provides right amount.
- AWS MSK - Grafana queries CloudWatch directly (or via the Prometheus CloudWatch Exporter) against the - AWS/Kafkaand- AWS/KafkaTopicnamespaces
- IAM role with - cloudwatch:GetMetricDatapermissions
 
- DigitalOcean Managed Kafka - Prometheus scrapes the cluster’s built-in HTTPS - /metricsendpoint using basic-auth and DO’s root-CA
- Static - scrape_configin- prometheus.yml(30–60 s interval)
- Optional JMX exporter for consumer-lag metrics 
 
- On-Prem Kafka - Prometheus JMX Exporter deployed alongside each broker (embedded or sidecar) 
- Scrape via a dedicated - job_name: kafka_onpremin- prometheus.yml
- TLS and auth per your internal security policies 
 
That covers the agentless managed-service scrapes plus the JMX exporter for any self-hosted clusters.
| Priority | Alert | Metric / Expression | Condition & Duration | Adjustable Parameters | 
| 🔴 P1 | Under-replicated partitions | 
 | >  | 
 | 
| 🔴 P1 | Offline partitions | 
 | >  | 
 | 
| 🔴 P1 | Controller availability | 
 | !=  | 
 | 
| 🔴 P1 | Broker process down | 
 | ==  | 
 | 
| 🟠 P2 | Consumer-group lag spike | 
 | >  | 
 | 
| 🟠 P2 | Produce latency | 
 | >  | 
 | 
| 🟠 P2 | Fetch latency | 
 | >  | 
 | 
| 🟠 P2 | Unexpected partition reassignments | 
 | >  | 
 | 
| 🟡 P3 | Disk usage low | 
 | <  | 
 | 
| 🟡 P3 | Broker CPU saturation | 
 | >  | 
 | 
| 🟡 P3 | GC pause duration | 
 | >  | 
 | 
| 🟡 P3 | Network request errors | 
 | >  | 
 | 
| 🟡 P3 | JVM heap usage high | 
 | >  | 
 | 
Use the same monitoring sub-module in each of your terraform-<provider>-kafka repositories. It consumes outputs from your local Kafka module and provisions ServiceMonitors (or CloudWatch alarms), dashboards, and alerts.
##########################################
# terraform-do-kafka/monitoring.tf
##########################################
module "kafka" {
  source = "./"
  # … Kafka provisioning …
}
module "monitoring" {
  source = "./modules/monitoring"
  # Identity & placement
  cluster_name = module.kafka.name
  namespace    = var.monitoring_namespace
  # Metrics source
  metrics_provider   = "prometheus"
  metrics_endpoint   = module.kafka.metrics_endpoint
  credentials_secret = var.metrics_credentials_secret
  # Scrape & alerting
  scrape_interval           = var.scrape_interval
  alert_evaluation_interval = var.alert_evaluation_interval
  # Thresholds
  under_replicated_threshold = var.under_replicated_threshold
  consumer_lag_threshold     = var.consumer_lag_threshold
  produce_latency_ms         = var.produce_latency_ms
  fetch_latency_ms           = var.fetch_latency_ms
  enable_dashboards = var.enable_dashboards
  enable_alerts     = var.enable_alerts
}
##########################################
# terraform-aws-kafka/monitoring.tf
##########################################
module "kafka" {
  source = "./"
  # … MSK provisioning …
}
module "monitoring" {
  source = "./modules/monitoring"
  cluster_name = module.kafka.name
  namespace    = var.monitoring_namespace
  # Choose CloudWatch vs Prometheus+exporter
  metrics_provider           = var.msk_metrics_provider       # "cloudwatch" or "prometheus"
  cloudwatch_region          = var.aws_region
  credentials_secret         = var.metrics_credentials_secret
  # If using Prometheus, deploy exporter automatically
  enable_exporter            = var.msk_metrics_provider == "prometheus"
  exporter_kubeconfig_secret = var.exporter_kubeconfig_secret
  scrape_interval           = var.scrape_interval
  alert_evaluation_interval = var.alert_evaluation_interval
  under_replicated_threshold = var.under_replicated_threshold
  consumer_lag_threshold     = var.consumer_lag_threshold
  produce_latency_ms         = var.produce_latency_ms
  fetch_latency_ms           = var.fetch_latency_ms
  enable_dashboards = var.enable_dashboards
  enable_alerts     = var.enable_alerts
}
##########################################
# terraform-onprem-kafka/monitoring.tf
##########################################
module "kafka" {
  source = "./"
  # … on-prem Kafka provisioning …
}
module "monitoring" {
  source = "./modules/monitoring"
  cluster_name = module.kafka.name
  namespace    = var.monitoring_namespace
  metrics_provider   = "prometheus"
  metrics_endpoint   = var.jmx_exporter_endpoint
  credentials_secret = var.jmx_exporter_credentials_secret
  scrape_interval           = var.scrape_interval
  alert_evaluation_interval = var.alert_evaluation_interval
  under_replicated_threshold = var.under_replicated_threshold
  consumer_lag_threshold     = var.consumer_lag_threshold
  produce_latency_ms         = var.produce_latency_ms
  fetch_latency_ms           = var.fetch_latency_ms
  enable_dashboards = var.enable_dashboards
  enable_alerts     = var.enable_alerts
}
Notes:
- All three provider repos use the same - monitoringsub-module signature.
- The AWS MSK variant can either hook into CloudWatch alarms or automatically deploy a Prometheus exporter. 
- Application-specific Kafka metrics (e.g., queue depths for App A/B) should live in those apps’ own dashboards alongside these shared Kafka vitals. 
Below is how you can fold the “Common Kafka Failure Scenarios” table into your runbooks and escalation procedures. First, present the table as a quick reference. Then for each critical or high-severity issue, provide a templated runbook entry.
| Issue | Severity | Description | 
| Broker Outages | Critical | Broker process down (crash, OOM kill, node failure) leading to unavailable partitions and data-loss risk. | 
| Under-Replicated Partitions | Critical | Followers falling out of ISR due to slow I/O or network blips, risking data durability if a broker fails. | 
| Offline Partitions | Critical | No leader for a partition (controller split-brain or ZK/KRaft hiccup), making data unavailable. | 
| Controller Unavailability | Critical | No active controller or frequent controller elections, breaking cluster coordination. | 
| Disk Full / Low Free Space | High | Log directories running out of space, causing broker crashes or halted writes. | 
| Excessive GC Pauses | High | Long JVM GC pauses blocking I/O threads, leading to client timeouts and ISR drops. | 
| Consumer-Lag Blowups | High | Consumers falling far behind (stuck or overloaded), risking stale processing and backlog growth. | 
| Partition Reassignment Storms | High | Unexpected rebalance activity causing high load and potential downtime. | 
| Network Errors & Timeouts | High | TLS/DNS/firewall issues causing request failures and interrupted replication or client communication. | 
| Schema Registry / ACL Failures | Medium | Mis-applied ACLs or registry unavailability causing authorization or serialization errors. | 
| Version-Upgrade / Rolling Restarts | Medium | Mixed-version clusters or incompatible configs leading to protocol errors during upgrades. | 
| Cross-DC / MirrorMaker Breaks | Medium | Inter-cluster replication lag or connectivity loss in MirrorMaker setups, risking data divergence. | 
For each Critical or High severity issue, create a runbook entry following this pattern:
- Precondition / Alert Trigger - Reference the alert name (e.g. - Under-Replicated Partitions > 0 for 5m).
 
- Initial Diagnostics - Check alert details in Alertmanager or CloudWatch (depending on platform). 
- Identify affected broker(s) via - cluster_nameand- instancelabels.
- Review recent metrics (ISR count, GC pause, disk usage, consumer lag) in Grafana. 
 
- Remediation Steps - Broker Outage - SSH into the node or - kubectl describe pod/- kubectl logs.
- If OOM, review memory limits and JVM heap usage; consider restarting the pod or freeing OS memory. 
 
- Under-Replicated Partitions - Verify slow follower metrics ( - kafka_server_fetchstats_...), GC pauses, disk I/O.
- Restart lagging broker or rebalance partitions with - kafka-reassign-partitions.sh.
 
- Offline Partitions - Check controller logs ( - kafka.controller.KafkaController).
- Force a leader election: 
 
- Excessive GC Pauses - Inspect GC logs for pause causes; adjust - -Xms/-Xmxor GC algorithm.
- Roll brokers one by one after tuning. 
 
 
- Verification - Confirm ISR returns to full set ( - UnderReplicatedPartitions == 0).
- Ensure no further alerts for this issue in the next 10 minutes. 
 
- Escalation - Immediate paging for Critical alerts. 
- If unresolved after 10 minutes, escalate to SRE lead. 
- After 30 minutes without resolution, involve senior Kafka engineer or DBA team. 
 
- Critical - Page on-call immediately. 
- Triage expectation: acknowledge within 5 min, resolve or escalate within 15 min. 
 
- High - Notify on-call via email/Slack. 
- Acknowledge within 15 min, resolve or escalate within 1 hour. 
 
- Medium - Log to ticketing system; review in next business cycle unless recurring. 
 
Define these alerts centrally (Alertmanager or CloudWatch), scoped by a service label so each team sees only their own issues.
| Alert | Expression | Severity | Action | 
| High Processing Latency | 
 | High | Page on-call if p99 > threshold | 
| Error-Rate Spike | 
 | High | Notify SRE for > error_rate_pct% errors | 
| Queue Depth High | 
 | Medium | Auto-scale consumers or investigate | 
| DLQ Growth | 
 | Medium | Alert team to inspect dead-letter queue | 
| Template Variables | 
Add these panel groups to both your main application-stack dashboard and each service-specific dashboard.
Use a $service template variable to render one block per service:
- App Throughput (msg/s) - rate(app_messages_processed_total{service="$service"}[1m])
- App 99th-Pct Latency - histogram_quantile(0.99, app_processing_duration_seconds_bucket{service="$service"}[5m])
- App Error Rate (%) - increase(app_message_errors_total{service="$service"}[5m]) / increase(app_messages_processed_total{service="$service"}[5m])
- App Queue Depth - app_queue_depth{service="$service"}
- DLQ Depth - app_dlq_depth{service="$service"}
- Consumer Lag - max by(partition)(kafka_consumergroup_lag{group=~"$service-.*"})
- Under-Replicated Partitions - sum(kafka_controller_replicamanager_underreplicatedpartitions{topic=~"$service-.*"})
- Produce Errors - increase(kafka_network_requestmetrics_requests_total{request="Produce", error!=""}[5m])
Arrange panels in two rows—Application Metrics on top, Kafka-Infra Metrics below—and keep a consistent grid layout for each service block.
For each service’s own dashboard (no selector needed), include the same eight panels, pre-filtered to that service:
- Throughput 
- 99th-Pct Latency 
- Error Rate 
- Queue Depth 
- DLQ Depth 
- Consumer Lag 
- Under-Replicated Partitions 
- Produce Errors 
Group panels into two sections—application KPIs first, then Kafka infra—so service owners get both business and platform insights at a glance!