Kafka Monitoring (managed)

Document describes how kafka should be monitored, what are components involved, what are metrics to be monitored, what are ones to configure to receive alarms and what are thresholds.

1. The Setup

We might need to use prometheus exporter, but not sure yet, we should come up with metrics to monitor and then see which method provides right amount.

  • AWS MSK

    • Grafana queries CloudWatch directly (or via the Prometheus CloudWatch Exporter) against the AWS/Kafka and AWS/KafkaTopic namespaces

    • IAM role with cloudwatch:GetMetricData permissions

  • DigitalOcean Managed Kafka

    • Prometheus scrapes the cluster’s built-in HTTPS /metrics endpoint using basic-auth and DO’s root-CA

    • Static scrape_config in prometheus.yml (30–60 s interval)

    • Optional JMX exporter for consumer-lag metrics

  • On-Prem Kafka

    • Prometheus JMX Exporter deployed alongside each broker (embedded or sidecar)

    • Scrape via a dedicated job_name: kafka_onprem in prometheus.yml

    • TLS and auth per your internal security policies

That covers the agentless managed-service scrapes plus the JMX exporter for any self-hosted clusters.

2. Critical Metrics to be Monitored

Priority

Alert

Metric / Expression

Condition & Duration

Adjustable Parameters

🔴 P1

Under-replicated partitions

kafka_controller_replicamanager_underreplicatedpartitions

> 0 for > 5m

<replica_threshold>=0, <replica_duration>=5m

🔴 P1

Offline partitions

kafka_controller_kafkacontroller_offlinepartitionscount

> 0 immediately

<offline_threshold>=0

🔴 P1

Controller availability

kafka_controller_kafkacontroller_activecontrollercount

!= 1 for > 1m

<controller_expected>=1, <controller_duration>=1m

🔴 P1

Broker process down

up{job="kafka_onprem"}

== 0 for > 2m

<job_name>, <down_duration>=2m

🟠 P2

Consumer-group lag spike

max by(group) (kafka_consumergroup_lag{group!=""})

> {{lag_threshold}} for > 2m

<lag_threshold>, <lag_duration>=2m

🟠 P2

Produce latency

histogram_quantile(0.99, kafka_network_requestmetrics_total_time_ms{request="Produce"}[5m])

> {{produce_latency_ms}} ms

<produce_quantile>=0.99, <produce_window>=5m, <produce_ms>

🟠 P2

Fetch latency

histogram_quantile(0.99, kafka_network_requestmetrics_total_time_ms{request=~"Fetch.*"}[5m])

> {{fetch_latency_ms}} ms

<fetch_quantile>=0.99, <fetch_window>=5m, <fetch_ms>

🟠 P2

Unexpected partition reassignments

increase(kafka_controller_partitionreassignments_oldpartitioncount[10m])

> 0 outside maintenance

<reassign_window>=10m

🟡 P3

Disk usage low

kafka_log_log_dir_free_bytes / kafka_log_log_dir_total_bytes

< 0.15

<disk_free_ratio>=0.15

🟡 P3

Broker CPU saturation

rate(process_cpu_seconds_total[5m]) / <num_cores>

> 0.90 for > 5m

<cpu_cores>, <cpu_threshold>=0.90, <cpu_window>=5m

🟡 P3

GC pause duration

jvm_gc_pause_seconds_max

> {{gc_pause_s}}

<gc_pause_s>

🟡 P3

Network request errors

increase(kafka_network_requestmetrics_errors_total[5m])

> 0 in 5m

<net_error_window>=5m

🟡 P3

JVM heap usage high

jvm_memory_bytes_used{area="heap"} / jvm_memory_bytes_max

> 0.85 for > 5m

<heap_threshold>=0.85, <heap_window>=5m

3. Monitoring Sub-Module Invocation

Use the same monitoring sub-module in each of your terraform-<provider>-kafka repositories. It consumes outputs from your local Kafka module and provisions ServiceMonitors (or CloudWatch alarms), dashboards, and alerts.

##########################################
# terraform-do-kafka/monitoring.tf
##########################################

module "kafka" {
  source = "./"
  # … Kafka provisioning …
}

module "monitoring" {
  source = "./modules/monitoring"

  # Identity & placement
  cluster_name = module.kafka.name
  namespace    = var.monitoring_namespace

  # Metrics source
  metrics_provider   = "prometheus"
  metrics_endpoint   = module.kafka.metrics_endpoint
  credentials_secret = var.metrics_credentials_secret

  # Scrape & alerting
  scrape_interval           = var.scrape_interval
  alert_evaluation_interval = var.alert_evaluation_interval

  # Thresholds
  under_replicated_threshold = var.under_replicated_threshold
  consumer_lag_threshold     = var.consumer_lag_threshold
  produce_latency_ms         = var.produce_latency_ms
  fetch_latency_ms           = var.fetch_latency_ms

  enable_dashboards = var.enable_dashboards
  enable_alerts     = var.enable_alerts
}
##########################################
# terraform-aws-kafka/monitoring.tf
##########################################

module "kafka" {
  source = "./"
  # … MSK provisioning …
}

module "monitoring" {
  source = "./modules/monitoring"

  cluster_name = module.kafka.name
  namespace    = var.monitoring_namespace

  # Choose CloudWatch vs Prometheus+exporter
  metrics_provider           = var.msk_metrics_provider       # "cloudwatch" or "prometheus"
  cloudwatch_region          = var.aws_region
  credentials_secret         = var.metrics_credentials_secret

  # If using Prometheus, deploy exporter automatically
  enable_exporter            = var.msk_metrics_provider == "prometheus"
  exporter_kubeconfig_secret = var.exporter_kubeconfig_secret

  scrape_interval           = var.scrape_interval
  alert_evaluation_interval = var.alert_evaluation_interval

  under_replicated_threshold = var.under_replicated_threshold
  consumer_lag_threshold     = var.consumer_lag_threshold
  produce_latency_ms         = var.produce_latency_ms
  fetch_latency_ms           = var.fetch_latency_ms

  enable_dashboards = var.enable_dashboards
  enable_alerts     = var.enable_alerts
}
##########################################
# terraform-onprem-kafka/monitoring.tf
##########################################

module "kafka" {
  source = "./"
  # … on-prem Kafka provisioning …
}

module "monitoring" {
  source = "./modules/monitoring"

  cluster_name = module.kafka.name
  namespace    = var.monitoring_namespace

  metrics_provider   = "prometheus"
  metrics_endpoint   = var.jmx_exporter_endpoint
  credentials_secret = var.jmx_exporter_credentials_secret

  scrape_interval           = var.scrape_interval
  alert_evaluation_interval = var.alert_evaluation_interval

  under_replicated_threshold = var.under_replicated_threshold
  consumer_lag_threshold     = var.consumer_lag_threshold
  produce_latency_ms         = var.produce_latency_ms
  fetch_latency_ms           = var.fetch_latency_ms

  enable_dashboards = var.enable_dashboards
  enable_alerts     = var.enable_alerts
}

Notes:

  • All three provider repos use the same monitoring sub-module signature.

  • The AWS MSK variant can either hook into CloudWatch alarms or automatically deploy a Prometheus exporter.

  • Application-specific Kafka metrics (e.g., queue depths for App A/B) should live in those apps’ own dashboards alongside these shared Kafka vitals.

4. On-Call Runbooks & Escalation

Below is how you can fold the “Common Kafka Failure Scenarios” table into your runbooks and escalation procedures. First, present the table as a quick reference. Then for each critical or high-severity issue, provide a templated runbook entry.

4.1 Quick Reference Table

Issue

Severity

Description

Broker Outages

Critical

Broker process down (crash, OOM kill, node failure) leading to unavailable partitions and data-loss risk.

Under-Replicated Partitions

Critical

Followers falling out of ISR due to slow I/O or network blips, risking data durability if a broker fails.

Offline Partitions

Critical

No leader for a partition (controller split-brain or ZK/KRaft hiccup), making data unavailable.

Controller Unavailability

Critical

No active controller or frequent controller elections, breaking cluster coordination.

Disk Full / Low Free Space

High

Log directories running out of space, causing broker crashes or halted writes.

Excessive GC Pauses

High

Long JVM GC pauses blocking I/O threads, leading to client timeouts and ISR drops.

Consumer-Lag Blowups

High

Consumers falling far behind (stuck or overloaded), risking stale processing and backlog growth.

Partition Reassignment Storms

High

Unexpected rebalance activity causing high load and potential downtime.

Network Errors & Timeouts

High

TLS/DNS/firewall issues causing request failures and interrupted replication or client communication.

Schema Registry / ACL Failures

Medium

Mis-applied ACLs or registry unavailability causing authorization or serialization errors.

Version-Upgrade / Rolling Restarts

Medium

Mixed-version clusters or incompatible configs leading to protocol errors during upgrades.

Cross-DC / MirrorMaker Breaks

Medium

Inter-cluster replication lag or connectivity loss in MirrorMaker setups, risking data divergence.

4.2 Runbook Templates

For each Critical or High severity issue, create a runbook entry following this pattern:

Runbook: Handle <Issue Name>
  • Precondition / Alert Trigger

    • Reference the alert name (e.g. Under-Replicated Partitions > 0 for 5m).

  • Initial Diagnostics

    1. Check alert details in Alertmanager or CloudWatch (depending on platform).

    2. Identify affected broker(s) via cluster_name and instance labels.

    3. Review recent metrics (ISR count, GC pause, disk usage, consumer lag) in Grafana.

  • Remediation Steps

    1. Broker Outage

      • SSH into the node or kubectl describe pod / kubectl logs.

      • If OOM, review memory limits and JVM heap usage; consider restarting the pod or freeing OS memory.

    2. Under-Replicated Partitions

      • Verify slow follower metrics (kafka_server_fetchstats_...), GC pauses, disk I/O.

      • Restart lagging broker or rebalance partitions with kafka-reassign-partitions.sh.

    3. Offline Partitions

      • Check controller logs (kafka.controller.KafkaController).

      • Force a leader election:

    4. Excessive GC Pauses

      • Inspect GC logs for pause causes; adjust -Xms/-Xmx or GC algorithm.

      • Roll brokers one by one after tuning.

  • Verification

    • Confirm ISR returns to full set (UnderReplicatedPartitions == 0).

    • Ensure no further alerts for this issue in the next 10 minutes.

  • Escalation

    • Immediate paging for Critical alerts.

    • If unresolved after 10 minutes, escalate to SRE lead.

    • After 30 minutes without resolution, involve senior Kafka engineer or DBA team.


4.3 Escalation Policy
  1. Critical

    • Page on-call immediately.

    • Triage expectation: acknowledge within 5 min, resolve or escalate within 15 min.

  2. High

    • Notify on-call via email/Slack.

    • Acknowledge within 15 min, resolve or escalate within 1 hour.

  3. Medium

    • Log to ticketing system; review in next business cycle unless recurring.

5. Application-Level Alerts

Define these alerts centrally (Alertmanager or CloudWatch), scoped by a service label so each team sees only their own issues.

Alert

Expression

Severity

Action

High Processing Latency

histogram_quantile(0.99, app_processing_duration_seconds_bucket{service="$service"}[5m]) > {{latency_threshold_s}}

High

Page on-call if p99 > threshold

Error-Rate Spike

increase(app_message_errors_total{service="$service"}[5m]) / increase(app_messages_processed_total{service="$service"}[5m]) > {{error_rate_pct}}

High

Notify SRE for > error_rate_pct% errors

Queue Depth High

app_queue_depth{service="$service"} > {{queue_depth_threshold}}

Medium

Auto-scale consumers or investigate

DLQ Growth

increase(app_dlq_depth{service="$service"}[10m]) > {{dlq_growth_threshold}}

Medium

Alert team to inspect dead-letter queue

Template Variables


6. Dashboard Widgets

Add these panel groups to both your main application-stack dashboard and each service-specific dashboard.

6.1 Main Application-Stack Dashboard

Use a $service template variable to render one block per service:

  1. App Throughput (msg/s)rate(app_messages_processed_total{service="$service"}[1m])

  2. App 99th-Pct Latencyhistogram_quantile(0.99, app_processing_duration_seconds_bucket{service="$service"}[5m])

  3. App Error Rate (%)increase(app_message_errors_total{service="$service"}[5m]) / increase(app_messages_processed_total{service="$service"}[5m])

  4. App Queue Depthapp_queue_depth{service="$service"}

  5. DLQ Depthapp_dlq_depth{service="$service"}

  6. Consumer Lagmax by(partition)(kafka_consumergroup_lag{group=~"$service-.*"})

  7. Under-Replicated Partitionssum(kafka_controller_replicamanager_underreplicatedpartitions{topic=~"$service-.*"})

  8. Produce Errorsincrease(kafka_network_requestmetrics_requests_total{request="Produce", error!=""}[5m])

Arrange panels in two rows—Application Metrics on top, Kafka-Infra Metrics below—and keep a consistent grid layout for each service block.

6.2 Service-Specific Dashboard

For each service’s own dashboard (no selector needed), include the same eight panels, pre-filtered to that service:

  • Throughput

  • 99th-Pct Latency

  • Error Rate

  • Queue Depth

  • DLQ Depth

  • Consumer Lag

  • Under-Replicated Partitions

  • Produce Errors

Group panels into two sections—application KPIs first, then Kafka infra—so service owners get both business and platform insights at a glance!