Document describes how kafka should be monitored, what are components involved, what are metrics to be monitored, what are ones to configure to receive alarms and what are thresholds.
We might need to use prometheus exporter, but not sure yet, we should come up with metrics to monitor and then see which method provides right amount.
AWS MSK
Grafana queries CloudWatch directly (or via the Prometheus CloudWatch Exporter) against the
AWS/Kafka
andAWS/KafkaTopic
namespacesIAM role with
cloudwatch:GetMetricData
permissions
DigitalOcean Managed Kafka
Prometheus scrapes the cluster’s built-in HTTPS
/metrics
endpoint using basic-auth and DO’s root-CAStatic
scrape_config
inprometheus.yml
(30–60 s interval)Optional JMX exporter for consumer-lag metrics
On-Prem Kafka
Prometheus JMX Exporter deployed alongside each broker (embedded or sidecar)
Scrape via a dedicated
job_name: kafka_onprem
inprometheus.yml
TLS and auth per your internal security policies
That covers the agentless managed-service scrapes plus the JMX exporter for any self-hosted clusters.
Priority | Alert | Metric / Expression | Condition & Duration | Adjustable Parameters |
🔴 P1 | Under-replicated partitions |
| > |
|
🔴 P1 | Offline partitions |
| > |
|
🔴 P1 | Controller availability |
| != |
|
🔴 P1 | Broker process down |
| == |
|
🟠 P2 | Consumer-group lag spike |
| > |
|
🟠 P2 | Produce latency |
| > |
|
🟠 P2 | Fetch latency |
| > |
|
🟠 P2 | Unexpected partition reassignments |
| > |
|
🟡 P3 | Disk usage low |
| < |
|
🟡 P3 | Broker CPU saturation |
| > |
|
🟡 P3 | GC pause duration |
| > |
|
🟡 P3 | Network request errors |
| > |
|
🟡 P3 | JVM heap usage high |
| > |
|
Use the same monitoring
sub-module in each of your terraform-<provider>-kafka
repositories. It consumes outputs from your local Kafka module and provisions ServiceMonitors (or CloudWatch alarms), dashboards, and alerts.
##########################################
# terraform-do-kafka/monitoring.tf
##########################################
module "kafka" {
source = "./"
# … Kafka provisioning …
}
module "monitoring" {
source = "./modules/monitoring"
# Identity & placement
cluster_name = module.kafka.name
namespace = var.monitoring_namespace
# Metrics source
metrics_provider = "prometheus"
metrics_endpoint = module.kafka.metrics_endpoint
credentials_secret = var.metrics_credentials_secret
# Scrape & alerting
scrape_interval = var.scrape_interval
alert_evaluation_interval = var.alert_evaluation_interval
# Thresholds
under_replicated_threshold = var.under_replicated_threshold
consumer_lag_threshold = var.consumer_lag_threshold
produce_latency_ms = var.produce_latency_ms
fetch_latency_ms = var.fetch_latency_ms
enable_dashboards = var.enable_dashboards
enable_alerts = var.enable_alerts
}
##########################################
# terraform-aws-kafka/monitoring.tf
##########################################
module "kafka" {
source = "./"
# … MSK provisioning …
}
module "monitoring" {
source = "./modules/monitoring"
cluster_name = module.kafka.name
namespace = var.monitoring_namespace
# Choose CloudWatch vs Prometheus+exporter
metrics_provider = var.msk_metrics_provider # "cloudwatch" or "prometheus"
cloudwatch_region = var.aws_region
credentials_secret = var.metrics_credentials_secret
# If using Prometheus, deploy exporter automatically
enable_exporter = var.msk_metrics_provider == "prometheus"
exporter_kubeconfig_secret = var.exporter_kubeconfig_secret
scrape_interval = var.scrape_interval
alert_evaluation_interval = var.alert_evaluation_interval
under_replicated_threshold = var.under_replicated_threshold
consumer_lag_threshold = var.consumer_lag_threshold
produce_latency_ms = var.produce_latency_ms
fetch_latency_ms = var.fetch_latency_ms
enable_dashboards = var.enable_dashboards
enable_alerts = var.enable_alerts
}
##########################################
# terraform-onprem-kafka/monitoring.tf
##########################################
module "kafka" {
source = "./"
# … on-prem Kafka provisioning …
}
module "monitoring" {
source = "./modules/monitoring"
cluster_name = module.kafka.name
namespace = var.monitoring_namespace
metrics_provider = "prometheus"
metrics_endpoint = var.jmx_exporter_endpoint
credentials_secret = var.jmx_exporter_credentials_secret
scrape_interval = var.scrape_interval
alert_evaluation_interval = var.alert_evaluation_interval
under_replicated_threshold = var.under_replicated_threshold
consumer_lag_threshold = var.consumer_lag_threshold
produce_latency_ms = var.produce_latency_ms
fetch_latency_ms = var.fetch_latency_ms
enable_dashboards = var.enable_dashboards
enable_alerts = var.enable_alerts
}
Notes:
All three provider repos use the same
monitoring
sub-module signature.The AWS MSK variant can either hook into CloudWatch alarms or automatically deploy a Prometheus exporter.
Application-specific Kafka metrics (e.g., queue depths for App A/B) should live in those apps’ own dashboards alongside these shared Kafka vitals.
Below is how you can fold the “Common Kafka Failure Scenarios” table into your runbooks and escalation procedures. First, present the table as a quick reference. Then for each critical or high-severity issue, provide a templated runbook entry.
Issue | Severity | Description |
Broker Outages | Critical | Broker process down (crash, OOM kill, node failure) leading to unavailable partitions and data-loss risk. |
Under-Replicated Partitions | Critical | Followers falling out of ISR due to slow I/O or network blips, risking data durability if a broker fails. |
Offline Partitions | Critical | No leader for a partition (controller split-brain or ZK/KRaft hiccup), making data unavailable. |
Controller Unavailability | Critical | No active controller or frequent controller elections, breaking cluster coordination. |
Disk Full / Low Free Space | High | Log directories running out of space, causing broker crashes or halted writes. |
Excessive GC Pauses | High | Long JVM GC pauses blocking I/O threads, leading to client timeouts and ISR drops. |
Consumer-Lag Blowups | High | Consumers falling far behind (stuck or overloaded), risking stale processing and backlog growth. |
Partition Reassignment Storms | High | Unexpected rebalance activity causing high load and potential downtime. |
Network Errors & Timeouts | High | TLS/DNS/firewall issues causing request failures and interrupted replication or client communication. |
Schema Registry / ACL Failures | Medium | Mis-applied ACLs or registry unavailability causing authorization or serialization errors. |
Version-Upgrade / Rolling Restarts | Medium | Mixed-version clusters or incompatible configs leading to protocol errors during upgrades. |
Cross-DC / MirrorMaker Breaks | Medium | Inter-cluster replication lag or connectivity loss in MirrorMaker setups, risking data divergence. |
For each Critical or High severity issue, create a runbook entry following this pattern:
Precondition / Alert Trigger
Reference the alert name (e.g.
Under-Replicated Partitions > 0 for 5m
).
Initial Diagnostics
Check alert details in Alertmanager or CloudWatch (depending on platform).
Identify affected broker(s) via
cluster_name
andinstance
labels.Review recent metrics (ISR count, GC pause, disk usage, consumer lag) in Grafana.
Remediation Steps
Broker Outage
SSH into the node or
kubectl describe pod
/kubectl logs
.If OOM, review memory limits and JVM heap usage; consider restarting the pod or freeing OS memory.
Under-Replicated Partitions
Verify slow follower metrics (
kafka_server_fetchstats_...
), GC pauses, disk I/O.Restart lagging broker or rebalance partitions with
kafka-reassign-partitions.sh
.
Offline Partitions
Check controller logs (
kafka.controller.KafkaController
).Force a leader election:
Excessive GC Pauses
Inspect GC logs for pause causes; adjust
-Xms/-Xmx
or GC algorithm.Roll brokers one by one after tuning.
Verification
Confirm ISR returns to full set (
UnderReplicatedPartitions == 0
).Ensure no further alerts for this issue in the next 10 minutes.
Escalation
Immediate paging for Critical alerts.
If unresolved after 10 minutes, escalate to SRE lead.
After 30 minutes without resolution, involve senior Kafka engineer or DBA team.
Critical
Page on-call immediately.
Triage expectation: acknowledge within 5 min, resolve or escalate within 15 min.
High
Notify on-call via email/Slack.
Acknowledge within 15 min, resolve or escalate within 1 hour.
Medium
Log to ticketing system; review in next business cycle unless recurring.
Define these alerts centrally (Alertmanager or CloudWatch), scoped by a service
label so each team sees only their own issues.
Alert | Expression | Severity | Action |
High Processing Latency |
| High | Page on-call if p99 > threshold |
Error-Rate Spike |
| High | Notify SRE for > error_rate_pct% errors |
Queue Depth High |
| Medium | Auto-scale consumers or investigate |
DLQ Growth |
| Medium | Alert team to inspect dead-letter queue |
Template Variables |
Add these panel groups to both your main application-stack dashboard and each service-specific dashboard.
Use a $service
template variable to render one block per service:
App Throughput (msg/s)
rate(app_messages_processed_total{service="$service"}[1m])
App 99th-Pct Latency
histogram_quantile(0.99, app_processing_duration_seconds_bucket{service="$service"}[5m])
App Error Rate (%)
increase(app_message_errors_total{service="$service"}[5m]) / increase(app_messages_processed_total{service="$service"}[5m])
App Queue Depth
app_queue_depth{service="$service"}
DLQ Depth
app_dlq_depth{service="$service"}
Consumer Lag
max by(partition)(kafka_consumergroup_lag{group=~"$service-.*"})
Under-Replicated Partitions
sum(kafka_controller_replicamanager_underreplicatedpartitions{topic=~"$service-.*"})
Produce Errors
increase(kafka_network_requestmetrics_requests_total{request="Produce", error!=""}[5m])
Arrange panels in two rows—Application Metrics on top, Kafka-Infra Metrics below—and keep a consistent grid layout for each service block.
For each service’s own dashboard (no selector needed), include the same eight panels, pre-filtered to that service:
Throughput
99th-Pct Latency
Error Rate
Queue Depth
DLQ Depth
Consumer Lag
Under-Replicated Partitions
Produce Errors
Group panels into two sections—application KPIs first, then Kafka infra—so service owners get both business and platform insights at a glance!