Architecture Decision Record: Exposing PHP‑FPM Metrics to Prometheus in Kubernetes

Status: Proposed – for team discussion

Date: 2025‑08‑05


1. Context
  • We run several PHP‑FPM workloads in Kubernetes but lack visibility into worker utilisation, queue depth, and slow‑request counts.

  • Prometheus is our standard telemetry backend; Grafana dashboards and alert rules expect Prometheus‑formatted metrics.

  • Constraints:

    • No application‑level changes – we cannot add libraries or modify PHP code.

    • Limited control over existing Helm charts – major spec edits are difficult to upstream.

    • Operational overhead should stay low; team is wary of maintaining many small custom components.

  • Desired outcome: surface PHP‑FPM pool metrics with minimal extra footprint while preserving per‑pod isolation and future maintainability.


2. Decision Drivers
  1. Ease of rollout (hours vs days).

  2. Operational simplicity – upgrades, troubleshooting, security patches.

  3. Performance overhead – CPU/RAM and label cardinality.

  4. Observability quality – per‑pod granularity, clear ownership of metrics.

  5. Security posture – avoid exposing raw status endpoints cluster‑wide.


3. Considered Options

Option

Summary

Score*

A. Sidecar exporter (hipages/php‑fpm_exporter)

Add one lightweight Go binary beside PHP‑FPM container; scrape /status locally and expose /metrics on :9253

8

B. Node‑level DaemonSet exporter

Run one exporter per node; auto‑discover sockets or TCP ports

5

C. Nginx / NJS inline transform

Use existing Nginx reverse‑proxy container to subrequest /status and re‑emit Prometheus text

7

D. Direct PodMonitor scrape

Point Prometheus at /status directly (requires FPM to emit Prom format)

3

E. Mutating‑webhook to inject exporter

Kyverno/OPA patch at admission time

6

*Preliminary 1–10 ranking against decision drivers.


4. Decision (Draft)

Adopt Option A – Sidecar exporter for new PHP‑FPM workloads.Rationale:

  • Keeps metrics and app lifecycle tightly coupled – scales with replicas.

  • Simple to reason about; widely adopted community image; negligible overhead (<10 MiB RAM, ~1 % CPU).

  • Avoids cross‑pod traffic and security exposure of raw /status pages.

Exceptions / Edge Cases
  • Legacy Helm charts we can’t modify quickly may temporarily run Option C (Nginx transform) to avoid new containers.

  • Very high‑density clusters could pilot Option B if sidecar count becomes a resource concern.


5. Consequences
  • Positive: Unified metrics pipeline; faster RCA; HPA tuning based on accurate concurrency data.

  • Negative: One extra image layer per pod; CI/CD pipeline must pin exporter tag and track CVEs.

  • Operational tasks:

    • Patch Helm values to add exporter container + Service port metrics.

    • Create corresponding ServiceMonitor with 30 s scrape interval and 15 s timeout.

    • Add runbook entry runbook/php-fpm.md describing key metrics and alert rules.


6. Open Questions
  1. Can we standardise Helm chart overlays across teams to minimise drift?

  2. Should we bundle exporter config into our base Helm library chart?

  3. What alert thresholds (e.g., active/total > 0.8 for 5 min) make sense for our traffic patterns?


Prepared by: DevOps Guild – draft for review

Comments & revisions welcome before 2025‑08‑12 review meeting