11. Common Pitfalls (And How to Avoid Them)
2024-09-01
Every migration comes with lessons — and many of them don’t show up in slide decks. Here are the most common pitfalls we’ve seen across 20+ production migrations, grouped into patterns you can look out for.
Probes start killing pods that aren't ready
Containers scale up and down without notice
IPs and instance IDs change frequently
Graceful shutdown is suddenly mandatory
We’ve seen apps fail on deployment because readiness probes exposed init bottlenecks, or background jobs failed in subtle ways when multiple replicas ran side-by-side.
Fix: Treat your app like it’s running in a hostile, dynamic environment — because it is. Simulate failures. Restart containers. Kill them mid-process. Watch what breaks.
Teams often move without rebuilding observability from scratch, relying on old metrics or dashboards that don't reflect the new reality.
No real load testing under production-like traffic
No synthetic transactions to validate cutover
Missing key metrics (queue lag, request errors, job failures)
Lack of log correlation between old and new systems
Fix: Build visibility first. Even partial dashboards are better than none. Compare key baselines before the switch, not after.
Slight config differences (timeouts, isolation) create cascading failures
Live replication setups don’t match production latency or behavior
Teams assume rollback is possible when it's not (esp. after writes start)
Background workers and retries behave differently post-move
Fix: Treat data like a critical asset with a non-reversible migration. Double-check everything. Add visibility into job behavior. Test under failure modes.
Just because EC2 is familiar doesn’t mean it's the right default. And just because you can lift something doesn't mean you should.
EC2 for services better suited to managed offerings (like RDS or EKS)
Using gp3 where io1 is required, or vice versa
Underprovisioned instance types that degrade silently
Overcomplicated network configs that slow down deployments
Broad IAM roles exposing you to lateral movement risk
Fix: Match the service to the workload, not to what your team used before. Reassess decisions as load, complexity, or team size changes.
It’s tempting to redo CI/CD, switch observability stacks, and rebuild backups while you migrate. That path rarely works.
Teams lose time debating non-blocking issues
Focus shifts from stability to elegance too early
Infrastructure becomes a sandbox instead of a platform
Fix: Timebox improvements. Defer "nice to have" upgrades until core systems are running. Migration is a phase, not a blank check for reinvention.
Migrations fail not from single points of failure, but from accumulated friction. Every skipped probe config, missing metric, or partial test becomes a blocker under load.
The best defense is realism: simulate load, kill things randomly, and don’t assume clean cutovers.
And above all, make sure your monitoring isn’t just watching the system — it’s watching the transition.