Designing AWS Network Architecture: What a Cloud Architect Actually Decides
Network design on AWS is not a diagram exercise. It is how you bound blast radius, route trust, meet compliance, and keep workloads reachable at the cost and operability your organization can sustain. This post walks through the decisions architects make before anyone provisions a subnet.
In short
Start from traffic flows and non-functional requirements, choose account and VPC boundaries, layer subnets by role (not by habit), plan IP space for hybrid growth, connect with the smallest pattern that works (endpoints before NAT where possible), and bake observability and governance into the landing zone from day one.
Why networking is an architect problem
Application teams feel networks when deploys fail, latency spikes, or a security review blocks a launch. Architects own the contracts those teams inherit: which CIDR blocks exist, how environments are isolated, how on-premises systems reach the cloud, and what “private” means in practice.
A good AWS network design aligns with the AWS Well-Architected Framework—especially the Security, Reliability, and Operational Excellence pillars—without turning every workload into a bespoke snowflake. The goal is a repeatable landing zone with clear extension points for product teams.
1. Gather requirements before drawing boxes
Resist opening the VPC wizard first. Document:
- Traffic flows — user → internet → ALB → app → database; batch jobs; east-west between microservices; admin access; SaaS integrations.
- Residency and compliance — data must stay in-region; regulated subnets; logging retention; encryption in transit mandates.
- Availability targets — multi-AZ minimum for production; active-active across regions only when RTO/RPO justify the complexity.
- Connectivity — hybrid (VPN, Direct Connect), partner networks, multi-cloud, or cloud-only.
- Tenancy model — one account per environment, per team, or per application; affects IP planning and Transit Gateway design.
- Operational ownership — who approves CIDR changes, who runs NOC for TGW, who pays for NAT hours.
These inputs become your architecture decision records (ADRs). When someone asks for a /16 in every account “just in case,” you can point to the IP registry and hybrid overlap rules.
2. Know the AWS networking building blocks
Architects do not memorize every feature launch, but they should fluently map problems to primitives:
| Layer | AWS construct | Architect use |
|---|---|---|
| Global | Region, Availability Zone, Local Zone, Wavelength | Place workloads close to users and dependencies; design for AZ failure, not only instance failure. |
| Isolation | VPC, subnet, route table | Hard boundary for routing and broadcast domains; subnets are AZ-scoped. |
| Edge | Internet Gateway, NAT Gateway, egress-only IGW | Controlled inbound/outbound internet; NAT is a cost and availability choke point—size and AZ-span deliberately. |
| Private AWS access | VPC interface/gateway endpoints, PrivateLink | Keep S3, DynamoDB, and API traffic off the public internet and often reduce NAT spend. |
| Hybrid & multi-VPC | Site-to-Site VPN, Direct Connect, Transit Gateway, VPC peering | Connect on-premises and many VPCs; prefer TGW hub-spoke at scale over mesh peering. |
| Policy | Security groups, NACLs, Network Firewall, WAF, Shield | Stateful instance-level rules (SG) vs subnet-level NACLs; edge vs east-west inspection. |
| DNS | Route 53 public/private zones, Resolver endpoints | Service discovery, hybrid DNS forwarding, split-horizon patterns. |
3. Layered VPC pattern (the default worth knowing)
Most production VPCs use tiers by function, stretched across at least two AZs:
- Public subnets — load balancers, NAT gateways, bastion hosts (if you still use them). Route to an Internet Gateway for inbound/outbound internet paths you intend to expose.
- Private application subnets — EC2, ECS, EKS nodes, Lambda with VPC attachment. Default route to NAT for outbound internet when required; prefer endpoints for AWS APIs.
- Private data subnets — RDS, ElastiCache, OpenSearch. No direct internet route; security groups allow only from app tiers.
- Isolated subnets (optional) — no NAT, no IGW; internal-only or endpoint-only workloads.
Internet
│
▼
┌─────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Public tier │────▶│ Private app tier │────▶│ Private data │
│ ALB, NAT │ │ ECS/EKS/EC2 │ │ RDS, cache │
└─────────────┘ └──────────────────┘ └─────────────────┘
│ │ │
└─────────────────────┴────────────────────────┘
VPC endpoints (S3, STS, ECR, …)
optional: TGW / VPN / DX to on-prem
Subnet count grows with AZs: three AZs × three tiers = nine subnets before specialized slices (DMZ, analytics, etc.). That is fine if automation creates them; painful if hand-clicked.
4. IP addressing and future-proofing
CIDR planning is one of the few irreversible early mistakes.
- Reserve ranges per environment, region, and account in a central IP address management (IPAM) registry—AWS IPAM or your CMDB.
- Avoid overlapping RFC1918 space with on-premises, other clouds, and acquisitions; overlaps break VPN and TGW routing.
- Size for pod density: EKS can consume many IPs per node; Lambda in VPC scales ENIs; leave headroom in private subnets.
- Document whether you will use IPv6 (dual-stack ALB, egress-only IGW) for compliance or mobile clients.
Rule of thumb: a single-region application VPC often starts with a /16 split into /24 or /20 subnets per tier per AZ—but the right answer is the smallest plan that satisfies growth tables you have written down.
5. Connectivity patterns and when to use them
- VPC endpoints — First choice for AWS service access (S3 gateway endpoint is free; interface endpoints bill hourly). Reduces data exfiltration surface and NAT costs.
- NAT Gateway — When private resources must reach arbitrary internet destinations. Deploy per AZ for resilience; accept the cost or use NAT instances only with eyes open on ops burden.
- VPC peering — Simple, non-transitive VPC-to-VPC links. Good for a few peers; becomes operational debt as a full mesh.
- Transit Gateway — Hub for many VPCs, VPN, and Direct Connect attachments; route domains and appliance modes for inspection VPCs.
- AWS PrivateLink — Consumer access to a service without VPC peering; common for SaaS and shared platform services.
- Site-to-Site VPN / Direct Connect — Hybrid: VPN for speed-to-value; DX for predictable bandwidth and lower latency to on-premises.
Architects choose the minimum coupling that meets the flow. If two systems only talk through an internal ALB, they may not need full VPC peering—PrivateLink or a shared services VPC might suffice.
6. Security architecture on the wire
Defense in depth for networks means overlapping controls, not duplicate firewalls everywhere:
- Security groups — Default deny between tiers; reference other SGs, not CIDR sprawl. Document standard “web tier,” “app tier,” “data tier” SG templates.
- NACLs — Coarse subnet guards; useful for explicit deny lists, not day-to-day micro-segmentation.
- AWS Network Firewall or third-party NVAs — East-west or egress inspection when regulations require IDS/IPS on traffic between VPCs.
- WAF + Shield — At CloudFront or ALB for HTTP threats; network design must route user traffic through those edges intentionally.
- IAM and resource policies — Remember S3 and API access is not only network path; “private subnet” does not alone secure data.
- VPC Flow Logs, DNS query logs, Traffic Mirroring — Evidence for IR and capacity planning; decide retention and who can read logs (separate security account).
Pair network controls with identity and logging discipline—architects who only draw subnets leave half the attack surface unowned.
7. Multi-account and landing zones
At organizational scale, the network is a platform product:
- AWS Organizations with Service Control Policies (SCPs) guardrails—e.g. no internet gateways in data accounts.
- Shared networking account hosts Transit Gateway, centralized egress, or firewall VPCs.
- AWS Resource Access Manager (RAM) shares subnets for “VPC factory” patterns where app accounts launch into centrally managed networks.
- Separate security / log archive accounts receive flow logs and firewall logs without application-team delete rights.
Hub-spoke via Transit Gateway is the common enterprise pattern: spokes are workload VPCs; the hub carries shared services, inspection, and hybrid attachments. Document route propagation—asymmetric routing causes subtle outages.
8. Multi-region and disaster recovery
Networks follow DR strategy, not the reverse:
- Backup and restore — Second region VPC can be smaller until failover; DNS and AMIs must be ready.
- Pilot light / warm standby — Partial capacity in DR region; replicate data and security group templates as code.
- Active-active — Global load balancing, health checks, and often complex data-layer networking; justify with business RTO.
Route 53 health checks and failover routing are part of the design. Cross-region VPC peering or TGW inter-region peering adds cost and complexity—use when traffic volume warrants it.
9. Operability: what you inherit after launch
Architects who “throw designs over the wall” create 3 a.m. pages. Bake in:
- Infrastructure as code — VPC, subnets, routes, and endpoints in Terraform, CloudFormation, or CDK; peer review like application code.
- Tagging standards —
Environment,CostCenter,NetworkTierfor chargeback and automation. - Runbooks — NAT exhaustion, TGW route limits, endpoint DNS failures, certificate expiry on PrivateLink services.
- Capacity alarms — Available IPs per subnet, NAT Gateway port allocation, TGW attachment throughput.
This is the same operational mindset as GitOps: declared state, reviewed changes, observable drift—applied to the network layer.
10. Cost and sustainability levers architects control
Network line items surprise finance teams: NAT Gateway hours and data processing, cross-AZ traffic, TGW attachments, interface endpoints, and DX ports. Design choices matter:
- Endpoint-heavy designs vs NAT-heavy outbound internet.
- AZ-local traffic for tiered apps (same-AZ where consistency allows).
- Consolidated egress through a shared services VPC vs NAT per spoke.
See also designing for cost and sustainability when ARB reviews include FinOps and GreenOps criteria.
Common pitfalls
- Flat VPC — Everything in one subnet “because it is simpler”; blast radius and compliance reviews suffer.
- Over-peering — Full mesh between dozens of VPCs; routing tables become undebuggable.
- Single NAT in one AZ — AZ outage takes all private outbound traffic with it.
- Hard-coded CIDRs in security groups — Breaks when IP plans change; prefer SG references.
- Console-only drift — Manual route table edits not reflected in IaC; GitOps for networks fixes this.
- Ignoring DNS — Hybrid failures are often Resolver forwarding, not “the VPC is down.”
Architect review checklist
Use these questions in design reviews before build:
- What are the north-south and east-west flows, and who owns each hop?
- Which accounts and VPCs bound blast radius for this workload?
- Is the IP plan registered and non-overlapping with hybrid and future regions?
- How do private workloads reach AWS APIs and the internet—endpoints, NAT, or both?
- How does traffic reach on-premises and partner networks—and what fails if one AZ drops?
- Where are logs stored, who can delete them, and what is the retention?
- What is automated, tagged, and documented for the on-call engineer?
How this connects to the rest of cloud architecting
Networking is the skeleton; compute, data, and identity are the organs. My notes on Cloud Architecting and cloud platform evolution cover the wider landing-zone story—network design is where multi-AZ reliability and segmentation become concrete.
Start small: one well-structured VPC with endpoints, layered subnets, flow logs, and IaC. Grow into Organizations, Transit Gateway, and inspection VPCs when traffic and governance demand it—not because a reference diagram looked impressive on slide one.
Further reading
- AWS — VPC User Guide, Transit Gateway, and AWS Well-Architected Framework (Security & Reliability pillars)
- AWS — Building a Scalable and Secure Multi-VPC AWS Network Infrastructure (whitepaper-style guidance)
- AWS — AWS Security Reference Architecture for opinionated account and network layouts
- HashiCorp / community patterns — Terraform modules for VPC and TGW (study for structure, not blind copy-paste)
Blog index · Cloud Architecting · Cloud Security Foundations · Cloud platform evolution