Severity → P1–P4 / Critical–Info
Component → The thing affected (service, DB, ALB, deployment, certificate, etc.)
Symptom → What’s wrong (deployment failing, 5xx error, latency spike, SSL expiring)
Levels
Threshold breached →
1.2 out of allowed 1.0 requests/secReplica failure →
0 available out of 3 replicasError ratio →
12% errors out of max 5% allowedLatency →
p95 latency 420ms out of SLA 250ms
Location → Where it happens (account/project/deployment/env/region)
Impact → Human effect (users can’t log in, requests failing, infra degraded)
Owner/Team → Who acts on it
Next Step / Runbook → What to do
Links → Dashboards, logs, runbooks, silence/ack
Chart (if available)
Original Message
Order | Field | Purpose / Description | Example | Include in Alert? |
1 | Severity | Urgency of alert (P1–P4 / Critical–Info) |
| ✅ Yes (header) |
2 | Service/Component | What is affected (service, endpoint, deployment) |
| ✅ Yes (header) |
3 | Symptom/Issue | Short description of what’s wrong |
| ✅ Yes (header) |
4 | Environment/Scope | Where it happens (cluster, env, region) |
| ✅ Yes |
5 | Impact | Human-readable effect (user/system impact) |
| ✅ Yes |
6 | Owner/Team | Who should take action |
| ✅ Yes |
7 | Next Step / Runbook | Quick guidance or link to runbook |
| ✅ Yes |
8 | Dashboard Link | Direct link to metrics panel | Grafana/CloudWatch/DataDog URL | ✅ Yes |
9 | Logs / Traces Link | Shortcut to logs or error traces | Loki/Kibana/Sentry | ✅ Yes (if available) |
10 | Chart (snapshot) | Inline chart (if source supports) | CW chart / Grafana panel snapshot | ✅ Optional but valuable |
11 | Source / Silence / Ack Links | Original tool link, silence, acknowledge | Grafana silence URL / OpsGenie ack | ✅ Yes |
12 | Original Message | Raw alert for debugging | Raw Grafana/CloudWatch text | ✅ Yes (collapsed or after summary) |
13 | Metadata / Tags | Extra labels for filtering/search |
| ❌ Not in main alert (keep hidden/metadata) |
[Icon Severity] P{prio}: "{symptom}" on {project / {component} / {resource}
📅 Date: {timestamp in UTC}
📊 State: {FIRING | RESOLVED}
👤 Owner: {team/person responsible}
🌍 Location: {provider}:{account}:{env}
📝 Description:
- Start with what was detected (metric + threshold breached).
- Add how it was evaluated (X of Y checks, duration if relevant).
- Provide system/service context (which component/resource is impacted,
what part of infra or app is showing the problem).
- Example: "Grafana detected that 5xx error rate on lineup-service endpoint:/lineup
exceeded threshold (1.2 vs 1.0 errors/sec, 3 of 5 checks over 5m).
This indicates instability in the backend or ingress."
⚠️ Impact:
- Translate technical failure → user/business effect.
- Include severity of effect (how many users, which functions, % of requests,
or if it’s only infra-level but no visible user effect).
- Be concrete, not generic.
- Example: "~10% of user requests to lineup-service are failing with 5xx errors,
causing partial disruption in production. Login and checkout flows are unaffected."
🔧 Next step: {clear first action or runbook link}
🔗 Links: [Dashboard] [Logs] [Silence/Ack]
📡 Source: {tool/system name}
--- Original ---
{raw alert message}You are helping me generate a standardized alert message for on-call use.
Use the following template and structure exactly:
{PrioIcon} P{Prio}: {Issue phrase + (Levels)} on {Component} / {Resource} / {Project}
📅 Date: {timestamp in UTC}
📊 State: {FIRING | RESOLVED}
👤 Owner: {team/person responsible}
🌍 Location: {provider}:{account}:{env}
📝 Description:
{Write 2-4 sentences. Start with what was detected (metric + threshold breached).
Include evaluation info (X of Y checks, duration).
Add system/service context (which component/resource is showing problem).}
⚠️ Impact:
{Write 1-3 sentences. Translate technical failure → user/business effect.
Be specific: % of requests failing, which endpoints affected, how many replicas down,
or state clearly if it’s infra-only with no direct user effect.}
🔧 Next step: {first troubleshooting action or runbook link}
🔗 Links: [Dashboard] [Logs] [Silence/Ack]
📡 Source: {tool/system name}
--- Original ---
{raw alert message}
Now, here are the details of the case:
- Priority: P2
- Component: lineup-service
- Resource: endpoint:/lineup
- Project/Deployment: spm-ingress-alerts
- Metric: 5xx error rate
- Threshold: 1.0 errors/sec
- Current: 1.2 errors/sec
- Evaluation: 3 of 5 checks, over 5m
- Impact: ~10% of user requests failing with 5xx errors in production
- Owner: Das Meta On-Call Team
- Source: Grafana via OpsGenie
- Links: [Dashboard], [Silence]
- Original: [FIRING:1] High 5xx Error Rate: Lineup Endpoint ...🟡 P3: Replica count too low (0 of 3) on agentic-workflow-backend / deployment / agentic-workflow
📅 Date: 2025-08-26 14:32 UTC
📊 State: FIRING
👤 Owner: Backend Team
🌍 Location: AWS:kauz-prod:Robert
📝 Description:
Grafana detected that the deployment agentic-workflow-backend
has insufficient replicas running in Robert cluster.
Condition was breached: 0 replicas available vs 3 expected,
sustained for more than 5m.
⚠️ Impact:
The agentic-workflow-backend is not serving traffic, causing request
failures in the agentic-workflow project. Users depending on this
backend may face service unavailability.
🔧 Next step: Check rollout with `kubectl describe deployment`
🔗 Links: [Dashboard] [Silence]
📡 Source: Grafana
--- Original ---
[FIRING:1] Deployment Failing Alert ...🟠 P2: Request volume too high (52k of 50k req/min) on alb-external-prod / loadbalancer / prod
📅 Date: 2025-08-26 13:15 UTC
📊 State: FIRING
👤 Owner: Infra Team
🌍 Location: AWS:970404667933:prod
📝 Description:
CloudWatch detected a sudden surge in request count on the production ALB.
Condition was breached: 52,000 requests/min vs 50,000 allowed,
observed in 4 of 5 checks over 10m.
⚠️ Impact:
The load balancer is near or above designed capacity.
If backend scaling does not keep up, users may experience
slower responses or request failures.
🔧 Next step: Validate backend auto-scaling and check error rates
🔗 Links: [CloudWatch Chart]
📡 Source: AWS CloudWatch
--- Original ---
❌ ALARM: "Performance Threshold Alert for Prod ALB (excl_5xx_4xx)" ...🔵 P4: SSL certificate renewed successfully (90 of 90 days) on managed-kauz.net / ssl-certificate / prod
📅 Date: 2025-08-25 13:02 UTC
📊 State: RESOLVED
👤 Owner: Infra Team
🌍 Location: AWS:kauz-prod:prod
📝 Description:
Updown.io detected that the SSL certificate for managed-kauz.net
was renewed.
Condition checked: new cert validity 90 of 90 days,
old cert replaced on all endpoints.
⚠️ Impact:
No user disruption. Secure connections remain valid.
If duplicate notifications appear, some servers may still serve the old cert.
🔧 Next step: Verify all nodes present the new certificate
🔗 Links: [Updown Panel]
📡 Source: Updown.io
--- Original ---
SSL certificate renewed for wasserhygiene.managed-kauz.net ...🔴 P1: Service failing with 504s (5 of 5 checks) on sandbox-barcode / url-check / sandbox-services
📅 Date: 2025-08-26 09:47 UTC
📊 State: FIRING
👤 Owner: Sandbox Team
🌍 Location: AWS:sandbox-services:sandbox
📝 Description:
Betterstack detected repeated 504 Gateway Timeout responses
from sandbox-barcode.
Condition observed: 5 of 5 health checks returned 504 errors
over 5m.
⚠️ Impact:
Sandbox BarCode service is fully unavailable.
All users accessing this endpoint are affected.
🔧 Next step: Check backend availability and restart the service if required
🔗 Links: [Acknowledge]
📡 Source: Betterstack
--- Original ---
Monitor: Sandbox - BarCode | Cause: Status 504 ...🟠 P2: Error rate too high (1.2 of 1.0 errors/sec) on lineup-service / endpoint:/lineup / spm-ingress-alerts
📅 Date: 2025-08-26 12:58 UTC
📊 State: FIRING
👤 Owner: Das Meta On-Call Team
🌍 Location: AWS:spm-ingress-alerts:prod
📝 Description:
Grafana alert detected that the 5xx error rate on lineup-service
endpoint:/lineup exceeded the defined threshold.
Condition: 1.2 errors/sec vs 1.0 allowed,
3 of 5 checks over 5m.
⚠️ Impact:
~10% of user requests to lineup-service are failing,
causing partial production disruption.
🔧 Next step: Check ingress logs and service backend for failures
🔗 Links: [Dashboard] [Silence]
📡 Source: Grafana via OpsGenie
--- Original ---
[FIRING:1] High 5xx Error Rate: Lineup Endpoint ...🟠 P2: Unexpected TypeErrors detected (62 of 0 expected) on yii2-website / controller:DefaultController.php / web-prod
📅 Date: 2025-08-26 03:27 UTC
📊 State: FIRING
👤 Owner: Suggested Assignee → Philipp R.
🌍 Location: AWS:web-prod:prod
📝 Description:
Sentry captured multiple TypeError exceptions in yii2-website.
Condition: 62 events observed vs 0 expected in the last 10m.
Code attempted to assign a string to MembershipTeamForm::$interval (int expected).
⚠️ Impact:
Pricing updates may fail for all users of the site.
Errors are ongoing and could affect subscription renewals
and membership management.
🔧 Next step: Fix type mismatch in MembershipTeamForm::$interval
🔗 Links: [Sentry Issue] [Repo]
📡 Source: Sentry
--- Original ---
TypeError /var/www/html/common/modules/premium/src/controllers/DefaultController.php ...