AWS Well-Architected: what I trust, what I ignore, what breaks anyway

The story I’m trying to tell (why frameworks help)

I like frameworks because they turn vague opinions into concrete questions. That matters when the system grows. It matters even more when the team changes.

The 12 Factor App did this for application design. It didn’t promise perfection. It forced discipline. It gave teams a shared language.

In India, there’s the story of Samundra-Manthan. The ocean is churned and fourteen things emerge. One of them is Dhanvanthri rising with Amrita, the elixir of immortality. Read the story.

That’s how I think about the AWS Well-Architected Framework. It’s not magic. It’s years of scars turned into questions.

The six pillars are a knife fight

The framework has six pillars. They aren’t independent.

When you push one, another one moves. Usually in the wrong direction.

Reliability pulls cost up. Security pulls velocity down. Performance pulls reliability down if you’re sloppy with caches and timeouts. Operational Excellence pulls everything toward sanity, but it takes time.

If you’re looking for “best practices,” you’re going to be disappointed.

How I run a Well-Architected review (without turning it into paperwork)

I don’t start with the questionnaire. I start with scope.

I want to know what’s in the workload, what users care about, and what would ruin someone’s weekend.

Then I use the AWS Well-Architected Tool to force two outcomes:

First, a short list of high/medium risks we’ll actually fix. Second, explicit risks we’re choosing to live with.

If the output isn’t a backlog, the review wasn’t worth the meeting.

Operational Excellence: boring deploys beat heroic debugging

The most expensive cloud bill I’ve seen wasn’t compute. It was people time. It was the team babysitting deploys because every release was a coin flip.

What breaks in production is usually basic:

You can’t tell what changed. You can’t roll back cleanly. Alerts page humans for noise. Dashboards don’t match the user experience. Runbooks don’t exist, or they’re fantasy.

Here’s the trade-off nobody says out loud. You can buy reliability with redundancy. You can’t buy it if you ship chaos.

If I’m looking at an AWS/EKS workload, I’m usually looking for a few concrete signals.

Is there a single source of truth for infra changes, like Terraform, and do people trust it. Or does every emergency fix happen in the console and get “captured later.”

Are deployments observable. If you use GitHub Actions, CodePipeline, or whatever, I want to see a deployment marker in CloudWatch dashboards and an easy path to rollback.

Are we measuring user pain or just machine stats. CPU graphs don’t tell you when checkout is timing out. Latency and error rate do. I’d rather have one honest SLO and an alert that fires twice a month than fifty alerts that fire every day.

Security: least privilege isn’t the goal, least blast radius is

I’ve seen too many teams obsess over a perfect IAM policy while they still run everything in one AWS account with admin roles floating around.

Least privilege is good. Least blast radius is better.

If one compromised workload can touch every environment, every database, and every CI secret, you don’t have “security posture.” You have luck.

In EKS, one of the most common self-inflicted wounds is giving pods more AWS permissions than they need.

If you’re still using the node instance role as the “app role,” you’re making lateral movement easy. Use IRSA with OIDC. Give each Kubernetes service account a role that’s small and boring.

Centralized logging isn’t a checkbox either. I want org-level CloudTrail, logs in a dedicated security account, and a clear retention policy. I want AWS Config in critical accounts so “who changed what” isn’t a mystery.

GuardDuty and Security Hub can help, but only if someone responds. If nobody owns findings, you’re paying for reassurance.

The early-stage shortcut I’ll still defend is picking a few guardrails you can enforce everywhere.

Use AWS Organizations. Use SCPs to block the truly dumb stuff. Keep humans on SSO. Keep break-glass access rare, audited, and painful.

Reliability: multi-AZ is table stakes, but recovery is the product

People say “we need four nines” like it’s a setting.

It’s not. It’s an operating model.

Multi-AZ by itself won’t save you if your failure mode is your own deployment, your database migrations, or your dependency chain. Most outages are self-inflicted.

The uncomfortable bit. If you can’t restore data fast, you don’t have reliability. You have a hope-and-pray strategy with backups.

The most useful reliability question isn’t “are we multi-AZ.”

It’s “what’s our plan when this dependency fails.”

If your database is RDS, do you have Multi-AZ and automatic backups with a tested restore. If you’re on DynamoDB, do you have PITR on and do you know how to recover after a bad write.

If you’re using SQS, do you have a dead-letter queue and alarms on it. If you’re using EventBridge, do you have retries and a replay story. If you’re using ALB, do you have sane target group health checks and timeouts.

Now the ugly trade-off. Reliability improvements often look like “more of everything.”

More clusters. More regions. More replication. More automation.

That can be correct. It can also be a trap. If the team can’t operate one cluster cleanly, two clusters just doubles the outage surface.

Performance Efficiency: the fastest way to melt prod is “works on my laptop”

Performance issues rarely start as performance issues. They start as missing limits, missing timeouts, and missing budgets.

In EKS, “performance” is also about noisy neighbors. It’s about resource requests that don’t match reality. It’s about autoscaling decisions based on the wrong signal.

If your scaling policy is CPU-only, but your bottleneck is I/O or lock contention, you’re just adding cost and getting slower.

Performance work gets silly when it’s not tied to a workload.

If you’re running APIs, I care about tail latency. p95 and p99. I care about cold starts if you’re serverless. I care about connection churn if you’re hitting RDS.

In AWS, the most common performance fix that also saves money is getting resource selection right.

Graviton can be a win, but only if your build pipeline produces multi-arch images and your sidecars don’t fall apart on arm64. This is where “best practice” becomes engineering work.

Caching is another one. CloudFront in front of the right things can make your backend look magically stable. It can also hide bugs and make invalidation mistakes painful. I’ve seen teams pay for performance by shipping the wrong cache headers.

Be specific. Timeouts. Retries. Connection pooling. Backpressure. Those are performance features, not afterthoughts.

Cost Optimization: it isn’t “spend less”

Cost Optimization is about spending on purpose. You want cost that buys reliability, speed, and security. You don’t want cost that buys comfort.

The anti-pattern I see the most is simple. Teams over-provision because they’re scared.

They don’t know the real load. They don’t trust autoscaling. They don’t trust their alerts. So they buy bigger instances and call it “stability.”

It works. Until the bill becomes a feature request.

The grey area: reliability eats cost for breakfast

This is where the Well-Architected pillars fight each other.

Reliability wants redundancy. Multi-AZ by default. Maybe multi-region. Maybe multi-cluster. Every one of those decisions adds cost. Not just AWS cost. Human cost.

If you’re early-stage, chasing 99.99% is a fast way to burn your roadmap.

You’ll pay for duplicate clusters and stacks. You’ll pay for cross-AZ and cross-region transfer. You’ll pay for extra load balancers, NAT gateways, and private connectivity. You’ll pay for replication, backups, and longer retention. You’ll pay for more IAM roles, more audit noise, and more on-call complexity.

And you still won’t get 99.99% if your change process is chaotic.

EKS is a quiet tax (and it’s usually not the nodes)

Most teams stare at EC2 instance cost. The leaks are usually elsewhere.

NAT Gateways are the classic one. They’re easy to forget. They show up as a surprise line item. Then you realize every private subnet route points to them.

Load balancers are another. One per service is fine for a demo. It gets expensive fast when every team ships their own.

Data transfer is the silent killer. Cross-AZ traffic isn’t free. Neither is pulling images, logs, and metrics across the wrong boundaries.

Then there’s “tooling sprawl.” Metrics, logs, tracing, agents. Each agent wants CPU and memory. Multiply by every node. Multiply by every cluster.

The boring wins that actually move the bill

Start with things that don’t require heroics.

Measure unit cost. Pick one.

Cost per request is great for APIs. Cost per active user is great for SaaS. Cost per GB ingested is great for pipelines. Track it weekly.

Then go after obvious waste. Schedule non-prod down when you can. Right-size the “always on” databases before you touch anything else. Delete orphaned load balancers, unattached EBS volumes, and old snapshots. Stop paying for idle capacity “just in case.”

S3 Intelligent-Tiering is useful, but it’s not free

S3 Intelligent-Tiering can be a solid move for long-lived data with unclear access patterns. It can also be a waste if you throw everything into it without thinking.

It charges a monitoring and automation fee per object. If you have millions of tiny objects, the fee can matter. If your data is short-lived, lifecycle rules to cheaper storage classes can be cleaner.

My rule is simple. Don’t enable it because it sounds smart. Enable it because the access pattern is unknown and the data lives long enough to justify the overhead.

Graviton is cheaper. Migration is not.

Graviton instances are often a win on cost and performance. The hurdle is operational.

You need multi-arch container images. That means build pipelines with buildx and clean base images. You need to validate sidecars and agents. Some “standard” components lag behind on arm64 support.

The first month can feel like death by a thousand paper cuts. Then it gets easier.

If you’re running EKS at scale, it’s usually worth doing. If your team can’t keep base images patched, don’t add another moving part yet.

Terraform and ArgoCD: guardrails or footguns

Tools don’t save money. Defaults save money.

Terraform is great for consistency. It’s also great at copying expensive choices everywhere. One bad module becomes a company-wide bill.

ArgoCD is great for drift detection. It’s also a magnet for “everything is a Helm chart” culture. That can inflate complexity. Complexity becomes cost. Not just compute cost. People cost.

The trade-off is real. IaC and GitOps reduce chaos. They can also harden bad decisions into code that nobody wants to touch.

The cost of complexity when you chase 99.99%

When someone says “we need four nines,” I ask one question.

What are we willing to delete?

Because four nines isn’t an AWS setting. It’s an operating model.

You’ll build multi-region failover. You’ll rehearse it. You’ll maintain it. You’ll staff for it. You’ll keep your dependencies honest. You’ll accept slower feature delivery.

If you don’t do those things, you’re just paying for duplicate infrastructure.

Sustainability: efficiency is usually the only honest move

Most sustainability work looks like cost work with better intent.

Less idle compute. Fewer always-on environments. Shorter retention where it’s safe. Smaller data movement. Modern instance families when you’re ready.

If you’re trying to be “green” while keeping three idle staging clusters around all weekend, you’re lying to yourself.

This pillar gets treated like a poster. I treat it like an engineering constraint.

When you remove waste, you almost always reduce carbon impact. When you pick more efficient compute, you usually reduce cost. When you stop copying prod into three giant non-prod environments, you free up both money and attention.

The trade-off shows up when sustainability conflicts with developer convenience.

Turning non-prod off at night saves a lot. It also breaks workflows unless you have good bootstrapping and data seeding. That’s operational excellence work disguised as sustainability.

What I’d do Monday morning

Pick one workload. Run a Well-Architected review. Be ruthless about scope. Use the official tool so the risks are recorded.

Then do five tasks:

  1. Write down one SLO and wire alerts to user impact.
  2. Add deploy markers and make rollback boring.
  3. Remove one overly-broad IAM role or policy path.
  4. Pick one unit cost metric and put it on a dashboard.
  5. Pause one “four nines” project until you can answer: who will operate it?

References (official AWS docs)