Alert Formatting be used in all places (via code)
Terminology
  • Severity → P1–P4 / Critical–Info

  • Component → The thing affected (service, DB, ALB, deployment, certificate, etc.)

  • Symptom → What’s wrong (deployment failing, 5xx error, latency spike, SSL expiring)

  • Levels

    • Threshold breached1.2 out of allowed 1.0 requests/sec

    • Replica failure0 available out of 3 replicas

    • Error ratio12% errors out of max 5% allowed

    • Latencyp95 latency 420ms out of SLA 250ms

  • Location → Where it happens (account/project/deployment/env/region)

  • Impact → Human effect (users can’t log in, requests failing, infra degraded)

  • Owner/Team → Who acts on it

  • Next Step / Runbook → What to do

  • Links → Dashboards, logs, runbooks, silence/ack

  • Chart (if available)

  • Original Message

Order

Field

Purpose / Description

Example

Include in Alert?

1

Severity

Urgency of alert (P1–P4 / Critical–Info)

P1 Critical

✅ Yes (header)

2

Service/Component

What is affected (service, endpoint, deployment)

agentic-workflow-backend

✅ Yes (header)

3

Symptom/Issue

Short description of what’s wrong

Deployment failing / High 5xx error rate

✅ Yes (header)

4

Environment/Scope

Where it happens (cluster, env, region)

prod-eu-central-1 / Robert cluster

✅ Yes

5

Impact

Human-readable effect (user/system impact)

User login failing / API degraded

✅ Yes

6

Owner/Team

Who should take action

Team Backend (on-call)

✅ Yes

7

Next Step / Runbook

Quick guidance or link to runbook

Check DB connections: [link]

✅ Yes

8

Dashboard Link

Direct link to metrics panel

Grafana/CloudWatch/DataDog URL

✅ Yes

9

Logs / Traces Link

Shortcut to logs or error traces

Loki/Kibana/Sentry

✅ Yes (if available)

10

Chart (snapshot)

Inline chart (if source supports)

CW chart / Grafana panel snapshot

✅ Optional but valuable

11

Source / Silence / Ack Links

Original tool link, silence, acknowledge

Grafana silence URL / OpsGenie ack

✅ Yes

12

Original Message

Raw alert for debugging

Raw Grafana/CloudWatch text

✅ Yes (collapsed or after summary)

13

Metadata / Tags

Extra labels for filtering/search

namespace=agentic-workflow, priority=P2

❌ Not in main alert (keep hidden/metadata)

🔹 Updated Format (with human-friendly Levels)
[Icon Severity] P{prio}: "{symptom}" on {project / {component} / {resource}

📅 Date: {timestamp in UTC}  
📊 State: {FIRING | RESOLVED}  
👤 Owner: {team/person responsible}  
🌍 Location: {provider}:{account}:{env}  

📝 Description: 
- Start with what was detected (metric + threshold breached).  
- Add how it was evaluated (X of Y checks, duration if relevant).  
- Provide system/service context (which component/resource is impacted, 
  what part of infra or app is showing the problem).  
- Example: "Grafana detected that 5xx error rate on lineup-service endpoint:/lineup 
  exceeded threshold (1.2 vs 1.0 errors/sec, 3 of 5 checks over 5m). 
  This indicates instability in the backend or ingress."
  
⚠️ Impact: 
- Translate technical failure → user/business effect.  
- Include severity of effect (how many users, which functions, % of requests, 
  or if it’s only infra-level but no visible user effect).  
- Be concrete, not generic.  
- Example: "~10% of user requests to lineup-service are failing with 5xx errors, 
  causing partial disruption in production. Login and checkout flows are unaffected."
  
🔧 Next step: {clear first action or runbook link}  
🔗 Links: [Dashboard] [Logs] [Silence/Ack]  

📡 Source: {tool/system name}  

--- Original ---
{raw alert message}
Prompt
You are helping me generate a standardized alert message for on-call use.  
Use the following template and structure exactly:  

{PrioIcon} P{Prio}: {Issue phrase + (Levels)} on {Component} / {Resource} / {Project}

📅 Date: {timestamp in UTC}  
📊 State: {FIRING | RESOLVED}  
👤 Owner: {team/person responsible}  
🌍 Location: {provider}:{account}:{env}  

📝 Description: 
{Write 2-4 sentences. Start with what was detected (metric + threshold breached). 
Include evaluation info (X of Y checks, duration). 
Add system/service context (which component/resource is showing problem).}

⚠️ Impact: 
{Write 1-3 sentences. Translate technical failure → user/business effect. 
Be specific: % of requests failing, which endpoints affected, how many replicas down, 
or state clearly if it’s infra-only with no direct user effect.}

🔧 Next step: {first troubleshooting action or runbook link}  
🔗 Links: [Dashboard] [Logs] [Silence/Ack]  

📡 Source: {tool/system name}  

--- Original ---
{raw alert message}

Now, here are the details of the case:  
- Priority: P2  
- Component: lineup-service  
- Resource: endpoint:/lineup  
- Project/Deployment: spm-ingress-alerts  
- Metric: 5xx error rate  
- Threshold: 1.0 errors/sec  
- Current: 1.2 errors/sec  
- Evaluation: 3 of 5 checks, over 5m  
- Impact: ~10% of user requests failing with 5xx errors in production  
- Owner: Das Meta On-Call Team  
- Source: Grafana via OpsGenie  
- Links: [Dashboard], [Silence]  
- Original: [FIRING:1] High 5xx Error Rate: Lineup Endpoint ...
Examples
🔹 Example 1 – Grafana (Deployment failing)
🟡 P3: Replica count too low (0 of 3) on agentic-workflow-backend / deployment / agentic-workflow

📅 Date: 2025-08-26 14:32 UTC  
📊 State: FIRING  
👤 Owner: Backend Team  
🌍 Location: AWS:kauz-prod:Robert  

📝 Description: 
Grafana detected that the deployment agentic-workflow-backend 
has insufficient replicas running in Robert cluster.  
Condition was breached: 0 replicas available vs 3 expected, 
sustained for more than 5m.  

⚠️ Impact: 
The agentic-workflow-backend is not serving traffic, causing request 
failures in the agentic-workflow project. Users depending on this 
backend may face service unavailability.  

🔧 Next step: Check rollout with `kubectl describe deployment`  
🔗 Links: [Dashboard] [Silence]  

📡 Source: Grafana  

--- Original ---
[FIRING:1] Deployment Failing Alert ...
🔹 Example 2 – AWS CloudWatch (ALB request spike)
🟠 P2: Request volume too high (52k of 50k req/min) on alb-external-prod / loadbalancer / prod

📅 Date: 2025-08-26 13:15 UTC  
📊 State: FIRING  
👤 Owner: Infra Team  
🌍 Location: AWS:970404667933:prod  

📝 Description: 
CloudWatch detected a sudden surge in request count on the production ALB.  
Condition was breached: 52,000 requests/min vs 50,000 allowed, 
observed in 4 of 5 checks over 10m.  

⚠️ Impact: 
The load balancer is near or above designed capacity.  
If backend scaling does not keep up, users may experience 
slower responses or request failures.  

🔧 Next step: Validate backend auto-scaling and check error rates  
🔗 Links: [CloudWatch Chart]  

📡 Source: AWS CloudWatch  

--- Original ---
❌ ALARM: "Performance Threshold Alert for Prod ALB (excl_5xx_4xx)" ...
🔹 Example 3 – (SSL renewal)
🔵 P4: SSL certificate renewed successfully (90 of 90 days) on managed-kauz.net / ssl-certificate / prod

📅 Date: 2025-08-25 13:02 UTC  
📊 State: RESOLVED  
👤 Owner: Infra Team  
🌍 Location: AWS:kauz-prod:prod  

📝 Description: 
Updown.io detected that the SSL certificate for managed-kauz.net 
was renewed.  
Condition checked: new cert validity 90 of 90 days, 
old cert replaced on all endpoints.  

⚠️ Impact: 
No user disruption. Secure connections remain valid.  
If duplicate notifications appear, some servers may still serve the old cert.  

🔧 Next step: Verify all nodes present the new certificate  
🔗 Links: [Updown Panel]  

📡 Source: Updown.io  

--- Original ---
SSL certificate renewed for wasserhygiene.managed-kauz.net ...
🔹 Example 4 – Betterstack (Sandbox outage)
🔴 P1: Service failing with 504s (5 of 5 checks) on sandbox-barcode / url-check / sandbox-services

📅 Date: 2025-08-26 09:47 UTC  
📊 State: FIRING  
👤 Owner: Sandbox Team  
🌍 Location: AWS:sandbox-services:sandbox  

📝 Description: 
Betterstack detected repeated 504 Gateway Timeout responses 
from sandbox-barcode.  
Condition observed: 5 of 5 health checks returned 504 errors 
over 5m.  

⚠️ Impact: 
Sandbox BarCode service is fully unavailable.  
All users accessing this endpoint are affected.  

🔧 Next step: Check backend availability and restart the service if required  
🔗 Links: [Acknowledge]  

📡 Source: Betterstack  

--- Original ---
Monitor: Sandbox - BarCode | Cause: Status 504 ...
🔹 Example 5 – Grafana → OpsGenie (5xx error rate)
🟠 P2: Error rate too high (1.2 of 1.0 errors/sec) on lineup-service / endpoint:/lineup / spm-ingress-alerts

📅 Date: 2025-08-26 12:58 UTC  
📊 State: FIRING  
👤 Owner: Das Meta On-Call Team  
🌍 Location: AWS:spm-ingress-alerts:prod  

📝 Description: 
Grafana alert detected that the 5xx error rate on lineup-service 
endpoint:/lineup exceeded the defined threshold.  
Condition: 1.2 errors/sec vs 1.0 allowed, 
3 of 5 checks over 5m.  

⚠️ Impact: 
~10% of user requests to lineup-service are failing, 
causing partial production disruption.  

🔧 Next step: Check ingress logs and service backend for failures  
🔗 Links: [Dashboard] [Silence]  

📡 Source: Grafana via OpsGenie  

--- Original ---
[FIRING:1] High 5xx Error Rate: Lineup Endpoint ...
🔹 Example 6 – Sentry (PHP TypeError)
🟠 P2: Unexpected TypeErrors detected (62 of 0 expected) on yii2-website / controller:DefaultController.php / web-prod

📅 Date: 2025-08-26 03:27 UTC  
📊 State: FIRING  
👤 Owner: Suggested Assignee → Philipp R.  
🌍 Location: AWS:web-prod:prod  

📝 Description: 
Sentry captured multiple TypeError exceptions in yii2-website.  
Condition: 62 events observed vs 0 expected in the last 10m.  
Code attempted to assign a string to MembershipTeamForm::$interval (int expected).  

⚠️ Impact: 
Pricing updates may fail for all users of the site.  
Errors are ongoing and could affect subscription renewals 
and membership management.  

🔧 Next step: Fix type mismatch in MembershipTeamForm::$interval  
🔗 Links: [Sentry Issue] [Repo]  

📡 Source: Sentry  

--- Original ---
TypeError /var/www/html/common/modules/premium/src/controllers/DefaultController.php ...