Cloud Architecture

🚀 New Service Setup Checklist & Form

This checklist is to be filled out by developers when setting up a new service. Provide all required details and confirm that setup steps are complete.

📦 Service Basics

Item	Priority	Description	Example	Value
Service name	High	Unique name for the service	`payment-service`
Source repository URL	High	Git repository where the service code is hosted	`https://github.com/org/payment-service`
Start/build command	High	Command to build or start the service	`npm start`, `java -jar app.jar`
Environment variables required	High	List all ENV keys needed by the service	`DB_HOST, API_KEY, LOG_LEVEL`
Secrets required	High	Secrets to be stored securely in Vault/Secrets Manager	`DB_PASSWORD`, `JWT_SECRET`
Expected service ports	Medium	Ports the service listens on	`8080`
Owner/team	High	Team or person responsible for the service	`Payments Team`, `devops@company.com`
Documentation link	Medium	Link to service documentation in Confluence or GitHub Wiki	`https://confluence.org/payment-service`
Feature flags/toggles included	Medium	Are feature flags needed?	`enable-beta-feature`	✅ / ❌
Default configs documented	Medium	Base configs per environment documented	`config/staging.yaml`	✅ / ❌
Graceful shutdown	High	Does linkerd cover this?

☁️ Cloud & Deployment

Item	Priority	Description	Example	Value
Dockerfile exists	High	Confirm a working Dockerfile exists and is optimized	`FROM alpine:3.18`	✅ / ❌
K8s health checks implemented	High	Startup, readiness, and liveness probes configured	`/healthz` returns HTTP 200	✅ / ❌
Resource requests & limits	High	CPU/memory settings for k8s pods	`requests: cpu:500m, memory:256Mi`
Ingress required?	High	Does service need external access?	`Yes`	Yes / No
Ingress hostname/path	High	Hostname and path for ingress	`api.company.com/payment`
Deployment strategy	Medium	Deployment rollout type	`RollingUpdate`, `Canary`
Autoscaling configured	Medium	HPA/VPA set up for scaling pods	`HPA: min=2, max=5 pods`	Yes / No
Autoscaling configuration is configured based on application behaviour	Medium	CPU/Memory based or Queue Content based
ServiceAccount / RBAC configured	High	ServiceAccount and RBAC with least privilege	`payment-service-sa`	✅ / ❌
Pod disruption budgets	Medium	Ensures minimal service downtime during node upgrades	`minAvailable: 1`	✅ / ❌

📊 Observability

Item	Priority	Description	Example	Value
APM integration required?	High	Should service have tracing/profiling?	`Tempo tracing enabled`	Yes / No
Prometheus metrics exposed	High	Exposes /metrics endpoint with engine and app-specific metrics	`http_requests_total`, `queue_depth`	✅ / ❌
Key KPIs to monitor	High	Define KPIs for health and performance	`Error rate <1%, latency p95 <500ms`
Dashboards required?	Medium	Should Grafana dashboards be created?	`Grafana: payment-service-dashboard`	✅ / ❌
Alerts configured	High	Alerting for key metrics in place	`500 errors >10/min triggers PagerDuty`	✅ / ❌
Log format standardized	Medium	JSON logs with correlation IDs	`{"request_id":"abc-123", "message":"OK"}`	✅ / ❌
External synthetic health checks	Medium	Uptime monitoring from user perspective	`Pingdom health check enabled`	✅ / ❌
Audit logs implemented	Medium	Logs security-sensitive actions	`User X deleted resource Y`	✅ / ❌
Centralized logging setup	High	Logs shipped to centralized system (Loki, CloudWatch, ELK)	`JSON logs to CloudWatch`	✅ / ❌
Metric baselines documented	Medium	Define normal ranges for key metrics	`Latency p95 < 300ms in normal load`	✅ / ❌
Sentry	High	APM is instrumented		✅ / ❌

🔒 Security

Item	Priority	Description	Example	Value
Secrets stored securely	High	All secrets stored in Secrets Manager/Vault	`AWS Secrets Manager`	✅ / ❌
TLS/HTTPS enforced	High	HTTPS configured for all external endpoints	`cert-manager auto-renewal enabled`	✅ / ❌
API authentication method	High	Auth mechanism for API access	`OAuth2`, `JWT`, `API keys`
Vulnerability scanning enabled	High	Vulnerability scanning integrated in pipeline	`Trivy scan as GitLab CI stage`	✅ / ❌
Dependency scanning configured	Medium	Dependency scanning for known CVEs	`Dependabot alerts enabled`	✅ / ❌
RBAC for API endpoints	High	Authorization implemented	`admin role can delete users`	✅ / ❌
Image signing & verification	Medium	Container images are signed and verified	`cosign signed images`	✅ / ❌
Rate limiting implemented	Medium	Protect APIs from abuse and DoS attacks	`10 req/s per IP`	✅ / ❌
Data encryption configured	High	Data encrypted at rest and in transit	`RDS encryption enabled`	✅ / ❌
Penetration testing planned	Medium	Security testing included for critical endpoints	`Scheduled Q4`	✅ / ❌

🏁 Handoff & Operations

Item	Priority	Description	Example	Value
Ownership assigned	High	Service owner/contact documented	`devops@company.com`, `#payments-team`
Client team trained	Medium	Training provided to client team	`1h walkthrough recorded in Confluence`	✅ / ❌
Documentation updated	High	Docs available in Confluence/CloudBrowser	`Confluence page created: Service Overview`	✅ / ❌
Runbook created	High	Operational runbook for on-call teams	`payment-service-runbook.md`	✅ / ❌
Monitoring alerts tested	High	Simulated alerts to ensure delivery to on-call	`PagerDuty test succeeded`	✅ / ❌
Cost tagging configured	Medium	Tags applied for cost allocation	`Team:Payments, Env:Prod`	✅ / ❌
Backup procedures documented	High	Backup strategy for DBs/configs documented	`Daily RDS snapshots configured`	✅ / ❌
Support escalation path defined	Medium	Documented escalation contacts	`L1: DevOps, L2: SRE Team`	✅ / ❌
Post-deployment validation	High	Service tested and validated post-deploy	`Smoke tests passed`	✅ / ❌
Old unused configs removed	Low	Clean up unused configs/artifacts	`Removed test values from ConfigMap`	✅ / ❌

…