Kafka Monitoring Guide | Metrics, Alerts & Thresholds for Managed Kafka

Kafka Monitoring (managed)

Document describes how kafka should be monitored, what are components involved, what are metrics to be monitored, what are ones to configure to receive alarms and what are thresholds.

1. The Setup

We might need to use prometheus exporter, but not sure yet, we should come up with metrics to monitor and then see which method provides right amount.

https://github.com/danielqsj/kafka_exporter

https://artifacthub.io/packages/helm/prometheus-community/prometheus-kafka-exporter

https://docs.digitalocean.com/products/databases/kafka/how-to/monitor-clusters/

AWS MSK
- Grafana queries CloudWatch directly (or via the Prometheus CloudWatch Exporter) against the AWS/Kafka and AWS/KafkaTopic namespaces
- IAM role with cloudwatch:GetMetricData permissions
DigitalOcean Managed Kafka
- Prometheus scrapes the cluster’s built-in HTTPS /metrics endpoint using basic-auth and DO’s root-CA
- Static scrape_config in prometheus.yml (30–60 s interval)
- Optional JMX exporter for consumer-lag metrics
On-Prem Kafka
- Prometheus JMX Exporter deployed alongside each broker (embedded or sidecar)
- Scrape via a dedicated job_name: kafka_onprem in prometheus.yml
- TLS and auth per your internal security policies

That covers the agentless managed-service scrapes plus the JMX exporter for any self-hosted clusters.

2. Critical Metrics to be Monitored

Priority	Alert	Metric / Expression	Condition & Duration	Adjustable Parameters
🔴 P1	Under-replicated partitions	`kafka_controller_replicamanager_underreplicatedpartitions`	> `0` for > `5m`	`<replica_threshold>=0`, `<replica_duration>=5m`
🔴 P1	Offline partitions	`kafka_controller_kafkacontroller_offlinepartitionscount`	> `0` immediately	`<offline_threshold>=0`
🔴 P1	Controller availability	`kafka_controller_kafkacontroller_activecontrollercount`	!= `1` for > `1m`	`<controller_expected>=1`, `<controller_duration>=1m`
🔴 P1	Broker process down	`up{job="kafka_onprem"}`	== `0` for > `2m`	`<job_name>`, `<down_duration>=2m`
🟠 P2	Consumer-group lag spike	`max by(group) (kafka_consumergroup_lag{group!=""})`	> `{{lag_threshold}}` for > `2m`	`<lag_threshold>`, `<lag_duration>=2m`
🟠 P2	Produce latency	`histogram_quantile(0.99, kafka_network_requestmetrics_total_time_ms{request="Produce"}[5m])`	> `{{produce_latency_ms}} ms`	`<produce_quantile>=0.99`, `<produce_window>=5m`, `<produce_ms>`
🟠 P2	Fetch latency	`histogram_quantile(0.99, kafka_network_requestmetrics_total_time_ms{request=~"Fetch.*"}[5m])`	> `{{fetch_latency_ms}} ms`	`<fetch_quantile>=0.99`, `<fetch_window>=5m`, `<fetch_ms>`
🟠 P2	Unexpected partition reassignments	`increase(kafka_controller_partitionreassignments_oldpartitioncount[10m])`	> `0` outside maintenance	`<reassign_window>=10m`
🟡 P3	Disk usage low	`kafka_log_log_dir_free_bytes / kafka_log_log_dir_total_bytes`	< `0.15`	`<disk_free_ratio>=0.15`
🟡 P3	Broker CPU saturation	`rate(process_cpu_seconds_total[5m]) / <num_cores>`	> `0.90` for > `5m`	`<cpu_cores>`, `<cpu_threshold>=0.90`, `<cpu_window>=5m`
🟡 P3	GC pause duration	`jvm_gc_pause_seconds_max`	> `{{gc_pause_s}}`	`<gc_pause_s>`
🟡 P3	Network request errors	`increase(kafka_network_requestmetrics_errors_total[5m])`	> `0` in 5m	`<net_error_window>=5m`
🟡 P3	JVM heap usage high	`jvm_memory_bytes_used{area="heap"} / jvm_memory_bytes_max`	> `0.85` for > `5m`	`<heap_threshold>=0.85`, `<heap_window>=5m`

3. Monitoring Sub-Module Invocation

Use the same monitoring sub-module in each of your terraform-<provider>-kafka repositories. It consumes outputs from your local Kafka module and provisions ServiceMonitors (or CloudWatch alarms), dashboards, and alerts.

##########################################
# terraform-do-kafka/monitoring.tf
##########################################

module "kafka" {
  source = "./"
  # … Kafka provisioning …
}

module "monitoring" {
  source = "./modules/monitoring"

  # Identity & placement
  cluster_name = module.kafka.name
  namespace    = var.monitoring_namespace

  # Metrics source
  metrics_provider   = "prometheus"
  metrics_endpoint   = module.kafka.metrics_endpoint
  credentials_secret = var.metrics_credentials_secret

  # Scrape & alerting
  scrape_interval           = var.scrape_interval
  alert_evaluation_interval = var.alert_evaluation_interval

  # Thresholds
  under_replicated_threshold = var.under_replicated_threshold
  consumer_lag_threshold     = var.consumer_lag_threshold
  produce_latency_ms         = var.produce_latency_ms
  fetch_latency_ms           = var.fetch_latency_ms

  enable_dashboards = var.enable_dashboards
  enable_alerts     = var.enable_alerts
}

##########################################
# terraform-aws-kafka/monitoring.tf
##########################################

module "kafka" {
  source = "./"
  # … MSK provisioning …
}

module "monitoring" {
  source = "./modules/monitoring"

  cluster_name = module.kafka.name
  namespace    = var.monitoring_namespace

  # Choose CloudWatch vs Prometheus+exporter
  metrics_provider           = var.msk_metrics_provider       # "cloudwatch" or "prometheus"
  cloudwatch_region          = var.aws_region
  credentials_secret         = var.metrics_credentials_secret

  # If using Prometheus, deploy exporter automatically
  enable_exporter            = var.msk_metrics_provider == "prometheus"
  exporter_kubeconfig_secret = var.exporter_kubeconfig_secret

  scrape_interval           = var.scrape_interval
  alert_evaluation_interval = var.alert_evaluation_interval

  under_replicated_threshold = var.under_replicated_threshold
  consumer_lag_threshold     = var.consumer_lag_threshold
  produce_latency_ms         = var.produce_latency_ms
  fetch_latency_ms           = var.fetch_latency_ms

  enable_dashboards = var.enable_dashboards
  enable_alerts     = var.enable_alerts
}

##########################################
# terraform-onprem-kafka/monitoring.tf
##########################################

module "kafka" {
  source = "./"
  # … on-prem Kafka provisioning …
}

module "monitoring" {
  source = "./modules/monitoring"

  cluster_name = module.kafka.name
  namespace    = var.monitoring_namespace

  metrics_provider   = "prometheus"
  metrics_endpoint   = var.jmx_exporter_endpoint
  credentials_secret = var.jmx_exporter_credentials_secret

  scrape_interval           = var.scrape_interval
  alert_evaluation_interval = var.alert_evaluation_interval

  under_replicated_threshold = var.under_replicated_threshold
  consumer_lag_threshold     = var.consumer_lag_threshold
  produce_latency_ms         = var.produce_latency_ms
  fetch_latency_ms           = var.fetch_latency_ms

  enable_dashboards = var.enable_dashboards
  enable_alerts     = var.enable_alerts
}

Notes:

All three provider repos use the same monitoring sub-module signature.
The AWS MSK variant can either hook into CloudWatch alarms or automatically deploy a Prometheus exporter.
Application-specific Kafka metrics (e.g., queue depths for App A/B) should live in those apps’ own dashboards alongside these shared Kafka vitals.

4. On-Call Runbooks & Escalation

Below is how you can fold the “Common Kafka Failure Scenarios” table into your runbooks and escalation procedures. First, present the table as a quick reference. Then for each critical or high-severity issue, provide a templated runbook entry.

4.1 Quick Reference Table

Issue	Severity	Description
Broker Outages	Critical	Broker process down (crash, OOM kill, node failure) leading to unavailable partitions and data-loss risk.
Under-Replicated Partitions	Critical	Followers falling out of ISR due to slow I/O or network blips, risking data durability if a broker fails.
Offline Partitions	Critical	No leader for a partition (controller split-brain or ZK/KRaft hiccup), making data unavailable.
Controller Unavailability	Critical	No active controller or frequent controller elections, breaking cluster coordination.
Disk Full / Low Free Space	High	Log directories running out of space, causing broker crashes or halted writes.
Excessive GC Pauses	High	Long JVM GC pauses blocking I/O threads, leading to client timeouts and ISR drops.
Consumer-Lag Blowups	High	Consumers falling far behind (stuck or overloaded), risking stale processing and backlog growth.
Partition Reassignment Storms	High	Unexpected rebalance activity causing high load and potential downtime.
Network Errors & Timeouts	High	TLS/DNS/firewall issues causing request failures and interrupted replication or client communication.
Schema Registry / ACL Failures	Medium	Mis-applied ACLs or registry unavailability causing authorization or serialization errors.
Version-Upgrade / Rolling Restarts	Medium	Mixed-version clusters or incompatible configs leading to protocol errors during upgrades.
Cross-DC / MirrorMaker Breaks	Medium	Inter-cluster replication lag or connectivity loss in MirrorMaker setups, risking data divergence.

4.2 Runbook Templates

For each Critical or High severity issue, create a runbook entry following this pattern:

Runbook: Handle <Issue Name>

Precondition / Alert Trigger
- Reference the alert name (e.g. Under-Replicated Partitions > 0 for 5m).
Initial Diagnostics
1. Check alert details in Alertmanager or CloudWatch (depending on platform).
2. Identify affected broker(s) via cluster_name and instance labels.
3. Review recent metrics (ISR count, GC pause, disk usage, consumer lag) in Grafana.
Remediation Steps
1. Broker Outage
  - SSH into the node or kubectl describe pod / kubectl logs.
  - If OOM, review memory limits and JVM heap usage; consider restarting the pod or freeing OS memory.
2. Under-Replicated Partitions
  - Verify slow follower metrics (kafka_server_fetchstats_...), GC pauses, disk I/O.
  - Restart lagging broker or rebalance partitions with kafka-reassign-partitions.sh.
3. Offline Partitions
  - Check controller logs (kafka.controller.KafkaController).
  - Force a leader election:
4. Excessive GC Pauses
  - Inspect GC logs for pause causes; adjust -Xms/-Xmx or GC algorithm.
  - Roll brokers one by one after tuning.
Verification
- Confirm ISR returns to full set (UnderReplicatedPartitions == 0).
- Ensure no further alerts for this issue in the next 10 minutes.
Escalation
- Immediate paging for Critical alerts.
- If unresolved after 10 minutes, escalate to SRE lead.
- After 30 minutes without resolution, involve senior Kafka engineer or DBA team.

4.3 Escalation Policy

Critical
- Page on-call immediately.
- Triage expectation: acknowledge within 5 min, resolve or escalate within 15 min.
High
- Notify on-call via email/Slack.
- Acknowledge within 15 min, resolve or escalate within 1 hour.
Medium
- Log to ticketing system; review in next business cycle unless recurring.

5. Application-Level Alerts

Define these alerts centrally (Alertmanager or CloudWatch), scoped by a service label so each team sees only their own issues.

Alert	Expression	Severity	Action
High Processing Latency	`histogram_quantile(0.99, app_processing_duration_seconds_bucket{service="$service"}[5m]) > {{latency_threshold_s}}`	High	Page on-call if p99 > threshold
Error-Rate Spike	`increase(app_message_errors_total{service="$service"}[5m]) / increase(app_messages_processed_total{service="$service"}[5m]) > {{error_rate_pct}}`	High	Notify SRE for > error_rate_pct% errors
Queue Depth High	`app_queue_depth{service="$service"} > {{queue_depth_threshold}}`	Medium	Auto-scale consumers or investigate
DLQ Growth	`increase(app_dlq_depth{service="$service"}[10m]) > {{dlq_growth_threshold}}`	Medium	Alert team to inspect dead-letter queue

Template Variables

6. Dashboard Widgets

Add these panel groups to both your main application-stack dashboard and each service-specific dashboard.

6.1 Main Application-Stack Dashboard

Use a $service template variable to render one block per service:

App Throughput (msg/s)
rate(app_messages_processed_total{service="$service"}[1m])
App 99th-Pct Latency
histogram_quantile(0.99, app_processing_duration_seconds_bucket{service="$service"}[5m])
App Error Rate (%)
increase(app_message_errors_total{service="$service"}[5m]) / increase(app_messages_processed_total{service="$service"}[5m])
App Queue Depth
app_queue_depth{service="$service"}
DLQ Depth
app_dlq_depth{service="$service"}
Consumer Lag
max by(partition)(kafka_consumergroup_lag{group=~"$service-.*"})
Under-Replicated Partitions
sum(kafka_controller_replicamanager_underreplicatedpartitions{topic=~"$service-.*"})
Produce Errors
increase(kafka_network_requestmetrics_requests_total{request="Produce", error!=""}[5m])

Arrange panels in two rows—Application Metrics on top, Kafka-Infra Metrics below—and keep a consistent grid layout for each service block.

6.2 Service-Specific Dashboard

For each service’s own dashboard (no selector needed), include the same eight panels, pre-filtered to that service:

Throughput
99th-Pct Latency
Error Rate
Queue Depth
DLQ Depth
Consumer Lag
Under-Replicated Partitions
Produce Errors

Group panels into two sections—application KPIs first, then Kafka infra—so service owners get both business and platform insights at a glance!