AWS Networking Best Practices: VPC, Transit Gateway, and Beyond
Master AWS networking with this comprehensive guide. Learn VPC design, security groups, Transit Gateway, Direct Connect, and cost optimization strategies with production-ready examples.
AWS Networking: Production-Ready Guide
This is Part 1 of our Cloud Networking series. If you haven’t read the overview, start with Cloud Networking Done Right: Series Overview.
Other parts in this series
- Part 1a: AWS Service Networking Guide →
- Part 2: Azure Networking Best Practices →
- Part 3: GCP Networking Best Practices →
Quick Start: Deploy Your First AWS VPC
Want to get a production-ready VPC running quickly? This Terraform configuration gives you a solid foundation with all the essentials: multi-AZ deployment for high availability, properly segmented subnets for different workload tiers, NAT Gateways for secure outbound connectivity, and free VPC endpoints to reduce data transfer costs. You can customize the CIDR ranges and AZ selections based on your region and IP planning requirements.
Production VPC Module Configuration terraform
# Save as main.tf and run: terraform init && terraform applymodule "vpc" { source = "terraform-aws-modules/vpc/aws" version = "~> 5.0"
name = "my-production-vpc" cidr = "10.0.0.0/16" # 65,536 IP addresses
# Deploy across 3 AZs for high availability azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
# Private subnets: For application servers, no direct internet access private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
# Public subnets: For load balancers, NAT gateways public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
# Database subnets: Isolated tier for RDS database_subnets = ["10.0.201.0/24", "10.0.202.0/24", "10.0.203.0/24"]
# NAT Gateway configuration - High availability enable_nat_gateway = true single_nat_gateway = false # Set true for dev/test to save ~$64/month one_nat_gateway_per_az = true
# Enable DNS enable_dns_hostnames = true enable_dns_support = true
# FREE VPC Endpoints - Save on NAT Gateway costs enable_s3_endpoint = true # Saves $0.045/GB for S3 traffic enable_dynamodb_endpoint = true # Saves $0.045/GB for DynamoDB traffic
tags = { Environment = "production" ManagedBy = "terraform" }}
# Output for use in other modulesoutput "vpc_id" { value = module.vpc.vpc_id}
output "private_subnet_ids" { value = module.vpc.private_subnets}Estimated Monthly Cost: $96-150 (3 NAT Gateways + data transfer)
AWS VPC Architecture
Before diving into individual components, let’s understand how they work together in a typical production VPC. This architecture shows a highly available, multi-tier application deployed across two Availability Zones.
What you’re seeing:
- Public subnets host internet-facing resources (load balancers, NAT Gateways)
- Private subnets host application servers with no direct internet access
- Database subnets provide an additional isolation layer for sensitive data
- Multiple AZs ensure high availability - if one AZ fails, the other continues serving traffic
Understanding AWS VPC Components
1. VPC (Virtual Private Cloud)
What it is: Your isolated network in AWS where you launch resources.
Key Characteristics:
- Regional resource (doesn’t span regions)
- Requires CIDR block (e.g., 10.0.0.0/16)
- Default limit of 5 IPv4 CIDR blocks per VPC (can request increase)
- Supports IPv4 and IPv6 (dual-stack)
VPC Best Practices
- Use
/16for production VPCs - Provides 65,536 IP addresses for growth - Private IP ranges only - Use 10.0.0.0/8, 172.16.0.0/12, or 192.168.0.0/16
- Plan for growth - Running out of IPs requires complex migration
- Document IP allocation - Maintain clear records to avoid overlaps
2. Subnets
What they are: Segments of your VPC CIDR block, confined to a single Availability Zone.
Types:
- Public Subnet: Has route to Internet Gateway, resources can have public IPs
- Private Subnet: No direct internet access, uses NAT Gateway for outbound
- Database Subnet: Isolated subnet for databases, no internet access
Subnet Best Practices
- Multi-AZ deployment - Create subnets across multiple Availability Zones for high availability
- Use
/24for most subnets - Provides 256 IPs (AWS reserves 5) - Clear naming convention - Use descriptive names like
prod-public-us-east-1a,prod-private-us-east-1a - Reserve ranges - Keep subnet ranges available for future expansion
3. Internet Gateway (IGW)
The Internet Gateway is your VPC’s connection to the public internet. It’s a simple concept but critical to understand: without an IGW, your VPC is completely isolated from the internet.
How it works:
- Attached to your VPC (one IGW per VPC)
- Performs NAT for instances with public IP addresses
- Horizontally scaled, redundant, and highly available by AWS
- No bandwidth constraints or throughput limits
Key Points:
- Completely free - no hourly charges or data processing fees
- Only works for resources with public IP addresses
- Requires a route in your route table pointing
0.0.0.0/0to the IGW
When to use:
- Public-facing resources like Application Load Balancers
- Bastion hosts that need direct internet access
- Any resource that needs to be reachable from the internet
4. NAT Gateway
NAT Gateway solves a common problem: your private instances need to download updates, call external APIs, or access AWS services, but you don’t want to give them public IPs. NAT Gateway provides outbound internet connectivity while keeping instances completely private.
How it works:
- Deployed in a public subnet (needs internet access itself)
- Private instances route their outbound traffic through the NAT Gateway
- NAT Gateway translates private IPs to its public IP
- Return traffic is automatically routed back to the originating instance
- Important: Only works for outbound traffic - inbound connections are blocked
Cost: $0.045/hour ($32/month) + $0.045/GB data processed
Performance: Scales automatically up to 100 Gbps bandwidth per NAT Gateway
Best Practices:
- High availability: Deploy one NAT Gateway per AZ (prevents single point of failure)
- Cost optimization: Use single NAT Gateway in dev/test environments
- Avoid NAT charges: Use VPC Endpoints for AWS services (S3, DynamoDB, etc.)
- Monitor costs: Data processing fees can add up—review VPC Flow Logs to identify high-traffic sources
Cost Optimization:
Choosing the right NAT Gateway configuration can significantly impact your monthly AWS bill. For development and test environments where high availability isn’t critical, a single NAT Gateway saves around $64/month. Production environments should use one NAT Gateway per Availability Zone to ensure that if one AZ experiences issues, workloads in other AZs can still reach the internet independently.
NAT Gateway Cost Optimization terraform
# Development/Test: Single NAT Gatewaymodule "vpc_dev" { source = "terraform-aws-modules/vpc/aws"
enable_nat_gateway = true single_nat_gateway = true # Saves $64/month (2 NAT Gateways)}
# Production: One NAT Gateway per AZmodule "vpc_prod" { source = "terraform-aws-modules/vpc/aws"
enable_nat_gateway = true one_nat_gateway_per_az = true # High availability}5. Route Tables
Route tables are like GPS for your VPC traffic—they determine where network packets go. Every subnet must be associated with a route table, and that route table’s rules determine how traffic flows.
How they work:
- Each route has a destination (CIDR block) and a target (where to send traffic)
- Routes are evaluated from most specific to least specific
- Local routes (within VPC) are automatically added and can’t be deleted
- Each subnet can only be associated with one route table
Key Concepts:
- Main route table: Created automatically with your VPC, used by default for any subnet without explicit association
- Custom route tables: Create these for specific routing needs (public vs private subnets)
- Best practice: Don’t use the main route table for production subnets—create explicit route tables
Example Route Table (Private Subnet):
Destination Target10.0.0.0/16 local (VPC)0.0.0.0/0 nat-xxxxx (NAT Gateway)Example Route Table (Public Subnet):
Destination Target10.0.0.0/16 local (VPC)0.0.0.0/0 igw-xxxxx (Internet Gateway)6. Elastic IPs
Elastic IPs are static, public IPv4 addresses that you can allocate to your AWS account and associate with resources. They’re essential when you need a predictable public IP address.
Common use cases:
- NAT Gateways: Each NAT Gateway requires an Elastic IP
- Whitelisting: Third-party APIs that require IP whitelisting
- DNS records: When you need a static IP for A records
- Failover: Quickly remap to a standby instance during failures
Cost: Free when associated with a running resource. $0.005/hour (~$3.60/month) when unassociated—AWS charges for idle Elastic IPs to encourage efficient use.
The following Terraform configuration demonstrates how to allocate Elastic IPs for NAT Gateways across multiple Availability Zones. This pattern is common in production environments where you need predictable outbound IP addresses for whitelisting with third-party services or compliance requirements.
Elastic IP Allocation for NAT Gateways terraform
# Allocate Elastic IPs for NAT Gatewaysresource "aws_eip" "nat" { count = 3 # One per AZ domain = "vpc"
tags = { Name = "nat-eip-${count.index + 1}" }}
# Associate with NAT Gatewayresource "aws_nat_gateway" "main" { count = 3 allocation_id = aws_eip.nat[count.index].id subnet_id = module.vpc.public_subnets[count.index]
tags = { Name = "nat-gw-${count.index + 1}" }}
# Output for whitelisting with third-party servicesoutput "nat_public_ips" { description = "Public IPs for outbound traffic (whitelist these)" value = aws_eip.nat[*].public_ip}Elastic IP Best Practices
- Release unused EIPs: Audit regularly to avoid charges for unassociated IPs
- Document allocations: Track which EIPs are whitelisted with external services
- Use tags: Tag EIPs with their purpose for easy identification
- Limit of 5 per region: Default quota, can request increase if needed
AWS Security: Defense in Depth
AWS networking security follows a “defense in depth” strategy—multiple layers of security controls that work together. If one layer is breached, others provide backup protection. Think of it like a castle with multiple walls, moats, and gates.
The security layers:
Security Groups (Most Important!)
Security Groups are your most important security control in AWS. They act as virtual firewalls for your EC2 instances, controlling both inbound and outbound traffic. Understanding security groups is essential for any AWS deployment.
What makes them special:
- Stateful: If you allow inbound traffic on port 443, the response traffic is automatically allowed back out—you don’t need a separate outbound rule
- Deny by default: All inbound traffic is blocked unless you explicitly allow it. All outbound traffic is allowed by default
- Instance-level: Applied to ENIs (Elastic Network Interfaces), not subnets
- Dynamic: Changes take effect immediately—no need to restart instances
- Security group references: You can allow traffic from another security group instead of IP ranges (powerful for microservices)
Limits to know:
- 60 inbound + 60 outbound rules per security group (default, can increase)
- 5 security groups per ENI (default, can increase to 16)
- 2,500 security groups per VPC (default)
Common mistake: New users often confuse security groups with NACLs. Security groups are stateful and deny by default; NACLs are stateless and require explicit allow/deny rules.
Production Example:
This comprehensive example demonstrates a three-tier security group architecture commonly used in production environments. The web tier accepts traffic from the internet, the application tier only accepts traffic from the web tier, and the database tier only accepts traffic from the application tier. This creates a defense-in-depth approach where each layer can only communicate with its adjacent layers, significantly reducing the attack surface.
Three-Tier Security Group Architecture terraform
# Web tier security groupresource "aws_security_group" "web_tier" { name = "web-tier-sg" description = "Security group for web tier" vpc_id = module.vpc.vpc_id
# Allow HTTPS from anywhere ingress { description = "HTTPS from internet" from_port = 443 to_port = 443 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] }
# Allow HTTP (redirect to HTTPS at ALB) ingress { description = "HTTP from internet" from_port = 80 to_port = 80 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] }
# Outbound to app tier only egress { description = "To app tier" from_port = 8080 to_port = 8080 protocol = "tcp" security_groups = [aws_security_group.app_tier.id] }
tags = { Name = "web-tier-sg" Tier = "web" }}
# Application tier security groupresource "aws_security_group" "app_tier" { name = "app-tier-sg" description = "Security group for application tier" vpc_id = module.vpc.vpc_id
# Only allow traffic from web tier ingress { description = "From web tier" from_port = 8080 to_port = 8080 protocol = "tcp" security_groups = [aws_security_group.web_tier.id] }
# Outbound to database tier only egress { description = "To database tier" from_port = 5432 to_port = 5432 protocol = "tcp" security_groups = [aws_security_group.db_tier.id] }
# Outbound HTTPS for API calls egress { description = "HTTPS for external APIs" from_port = 443 to_port = 443 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] }
tags = { Name = "app-tier-sg" Tier = "application" }}
# Database tier security groupresource "aws_security_group" "db_tier" { name = "db-tier-sg" description = "Security group for database tier" vpc_id = module.vpc.vpc_id
# Only allow traffic from app tier ingress { description = "PostgreSQL from app tier" from_port = 5432 to_port = 5432 protocol = "tcp" security_groups = [aws_security_group.app_tier.id] }
# No outbound internet access # (Add specific egress rules if needed)
tags = { Name = "db-tier-sg" Tier = "database" }}Network ACLs (NACLs)
Network ACLs are often misunderstood and overused. They’re stateless firewalls that operate at the subnet level, providing an additional layer of security beyond Security Groups. However, for most use cases, Security Groups alone are sufficient.
Key difference from Security Groups:
- Stateless: Unlike Security Groups, NACLs don’t track connection state. If you allow inbound traffic on port 443, you must also explicitly allow the return traffic on ephemeral ports (1024-65535)
- Subnet-level: Applied to entire subnets, not individual instances
- Rule evaluation: Rules are numbered and evaluated in order (lowest number first)
- Allow and Deny: Can explicitly deny traffic (Security Groups can only allow)
When to use NACLs:
- Block specific IPs: Deny traffic from known malicious IP ranges
- Compliance requirements: Some regulations require subnet-level controls
- Additional defense layer: Defense in depth strategy
- Temporary blocks: Quick way to block traffic to an entire subnet
Best Practice: Start with the default NACL (allows all traffic) and only add custom NACLs when you have a specific security requirement. Most applications don’t need them.
VPC Flow Logs
VPC Flow Logs are your network traffic recorder—they capture information about IP traffic going to and from network interfaces in your VPC. Think of them as your VPC’s black box, essential for troubleshooting, security analysis, and cost optimization.
What they capture:
- Source and destination IP addresses
- Source and destination ports
- Protocol (TCP, UDP, ICMP)
- Number of packets and bytes
- Action taken (ACCEPT or REJECT)
- Timestamps
Why you need them:
- Troubleshooting: Diagnose why connections are failing (security group rules, routing issues)
- Security analysis: Detect unusual traffic patterns, potential attacks, or data exfiltration
- Cost optimization: Identify which resources are generating high data transfer costs
- Compliance: Many regulations require network traffic logging
- Forensics: Investigate security incidents after they occur
Cost consideration: Flow Logs are charged based on the amount of data ingested (~$0.50 per GB). For high-traffic VPCs, this can add up. Consider sampling or filtering to specific subnets.
Implementation:
VPC Flow Logs can be sent to either CloudWatch Logs for real-time analysis and alerting, or to S3 for cost-effective long-term storage and batch analysis with Athena. The CloudWatch option is ideal for operational monitoring and quick troubleshooting, while S3 with Parquet format is better for compliance requirements and historical analysis. Most production environments use both destinations for different retention periods.
VPC Flow Logs Configuration terraform
# VPC Flow Logs to CloudWatchresource "aws_flow_log" "vpc_flow_logs" { log_destination = aws_cloudwatch_log_group.flow_logs.arn log_destination_type = "cloud-watch-logs" traffic_type = "ALL" # or "ACCEPT" or "REJECT" vpc_id = module.vpc.vpc_id iam_role_arn = aws_iam_role.flow_logs.arn
tags = { Name = "vpc-flow-logs" }}
resource "aws_cloudwatch_log_group" "flow_logs" { name = "/aws/vpc-flow-log/main-vpc" retention_in_days = 90 # Adjust based on compliance needs
tags = { Name = "vpc-flow-logs" }}
# VPC Flow Logs to S3 (cheaper for long-term storage)resource "aws_flow_log" "vpc_flow_logs_s3" { log_destination = aws_s3_bucket.flow_logs.arn log_destination_type = "s3" traffic_type = "ALL" vpc_id = module.vpc.vpc_id
# Parquet format for Athena queries destination_options { file_format = "parquet" per_hour_partition = true }
tags = { Name = "vpc-flow-logs-s3" }}VPC Endpoints: Save Money on NAT Gateway
Here’s a common problem: Your EC2 instances in private subnets need to access AWS services like S3 or DynamoDB. Without VPC Endpoints, this traffic goes through your NAT Gateway, costing you $0.045/GB in data processing fees. For high-traffic applications, this can mean hundreds of dollars per month in unnecessary costs.
The solution: VPC Endpoints provide private connectivity to AWS services without going through the NAT Gateway or internet. Traffic stays on AWS’s private network, improving security and reducing costs.
Two types of VPC Endpoints:
- Gateway Endpoints (FREE): For S3 and DynamoDB only
- Interface Endpoints ($7/month + data): For most other AWS services
Cost savings example: If your application transfers 1TB/month to S3 through NAT Gateway:
- Without VPC Endpoint: 1000 GB × $0.045 = $45/month
- With Gateway Endpoint: $0/month
That’s $540/year saved per TB of S3 traffic!
Gateway Endpoints (FREE!)
Gateway Endpoints for S3 and DynamoDB are completely free and should be deployed in every VPC. They work by adding routes to your route tables that direct traffic destined for these services through AWS’s private network instead of the NAT Gateway. This not only saves money but also improves security by keeping traffic off the public internet and reduces latency for high-throughput workloads.
S3 and DynamoDB Gateway Endpoints terraform
# S3 Gateway Endpoint - FREE, no hourly chargesresource "aws_vpc_endpoint" "s3" { vpc_id = module.vpc.vpc_id service_name = "com.amazonaws.${var.region}.s3" vpc_endpoint_type = "Gateway"
# Associate with private subnet route tables route_table_ids = module.vpc.private_route_table_ids
tags = { Name = "s3-gateway-endpoint" }}
# DynamoDB Gateway Endpoint - FREEresource "aws_vpc_endpoint" "dynamodb" { vpc_id = module.vpc.vpc_id service_name = "com.amazonaws.${var.region}.dynamodb" vpc_endpoint_type = "Gateway" route_table_ids = module.vpc.private_route_table_ids
tags = { Name = "dynamodb-gateway-endpoint" }}Savings: $0.045/GB that would go through NAT Gateway
Interface Endpoints (Paid)
Interface Endpoints use AWS PrivateLink to create elastic network interfaces (ENIs) in your subnets that serve as entry points for traffic destined to supported AWS services. Unlike Gateway Endpoints, they support a wide range of services including ECR, CloudWatch, Secrets Manager, and many more. The trade-off is cost—each endpoint runs about $7/month plus data processing fees—but for services with high traffic volumes, the savings over NAT Gateway data processing fees can be substantial.
Cost: $0.01/hour ($7/month) + $0.01/GB data processed
ECR Interface Endpoints terraform
# ECR API Endpointresource "aws_vpc_endpoint" "ecr_api" { vpc_id = module.vpc.vpc_id service_name = "com.amazonaws.${var.region}.ecr.api" vpc_endpoint_type = "Interface" subnet_ids = module.vpc.private_subnets security_group_ids = [aws_security_group.vpc_endpoints.id] private_dns_enabled = true
tags = { Name = "ecr-api-endpoint" }}
# ECR Docker Endpointresource "aws_vpc_endpoint" "ecr_dkr" { vpc_id = module.vpc.vpc_id service_name = "com.amazonaws.${var.region}.ecr.dkr" vpc_endpoint_type = "Interface" subnet_ids = module.vpc.private_subnets security_group_ids = [aws_security_group.vpc_endpoints.id] private_dns_enabled = true
tags = { Name = "ecr-dkr-endpoint" }}
# Security group for VPC endpointsresource "aws_security_group" "vpc_endpoints" { name = "vpc-endpoints-sg" description = "Security group for VPC endpoints" vpc_id = module.vpc.vpc_id
ingress { description = "HTTPS from VPC" from_port = 443 to_port = 443 protocol = "tcp" cidr_blocks = [module.vpc.vpc_cidr_block] }
tags = { Name = "vpc-endpoints-sg" }}When to use Interface Endpoints:
- High traffic to AWS services (>100GB/month)
- Break-even point: ~$7/month endpoint vs NAT Gateway data processing
- Services: ECR, CloudWatch Logs, Systems Manager, Secrets Manager
Transit Gateway: Hub-and-Spoke Architecture
As your AWS environment grows, connecting multiple VPCs becomes complex. VPC Peering works for 2-3 VPCs, but with 5+ VPCs, you’d need to create dozens of peering connections (N×(N-1)/2 connections). Transit Gateway solves this by acting as a central hub—each VPC connects once to the Transit Gateway, and the Transit Gateway handles routing between all VPCs.
Why use Transit Gateway:
- Simplified connectivity: Connect 10 VPCs with 10 attachments instead of 45 peering connections
- Centralized routing: Manage all inter-VPC routing in one place
- On-premises integration: Connect your data center once, reach all VPCs
- Scalability: Supports thousands of VPCs per Transit Gateway
- Segmentation: Use route tables to control which VPCs can talk to each other
How it works:
- Create a Transit Gateway in your region
- Attach VPCs to the Transit Gateway (one attachment per VPC)
- Configure route tables on the Transit Gateway to control traffic flow
- Update VPC route tables to send inter-VPC traffic to the Transit Gateway
Cost: $0.05/hour ($36/month per Transit Gateway) + $0.02/GB data processed
Break-even point: Transit Gateway becomes cost-effective when you have 3+ VPCs that need to communicate. Below that, VPC Peering (free) is usually cheaper.
Implementation:
Setting up Transit Gateway involves creating the gateway itself, attaching your VPCs, and updating route tables in each VPC to direct inter-VPC traffic through the Transit Gateway. The configuration below shows a typical setup with production and development VPCs. Note that you attach subnets (not the VPC directly) to the Transit Gateway—typically your private subnets since that’s where most inter-VPC traffic originates.
Transit Gateway Hub-and-Spoke Setup terraform
# Transit Gatewayresource "aws_ec2_transit_gateway" "main" { description = "Main Transit Gateway" default_route_table_association = "enable" default_route_table_propagation = "enable" dns_support = "enable" vpn_ecmp_support = "enable"
tags = { Name = "main-tgw" }}
# Attach Production VPCresource "aws_ec2_transit_gateway_vpc_attachment" "production" { subnet_ids = module.production_vpc.private_subnets transit_gateway_id = aws_ec2_transit_gateway.main.id vpc_id = module.production_vpc.vpc_id dns_support = "enable"
tags = { Name = "tgw-attachment-production" }}
# Attach Development VPCresource "aws_ec2_transit_gateway_vpc_attachment" "development" { subnet_ids = module.development_vpc.private_subnets transit_gateway_id = aws_ec2_transit_gateway.main.id vpc_id = module.development_vpc.vpc_id dns_support = "enable"
tags = { Name = "tgw-attachment-development" }}
# Route from Production VPC to Transit Gatewayresource "aws_route" "production_to_tgw" { count = length(module.production_vpc.private_route_table_ids) route_table_id = module.production_vpc.private_route_table_ids[count.index] destination_cidr_block = "10.0.0.0/8" # All internal traffic transit_gateway_id = aws_ec2_transit_gateway.main.id}When to Use Transit Gateway
- Multiple VPCs - 3+ VPCs that need to communicate with each other
- Hybrid cloud - On-premises connectivity via VPN or Direct Connect
- Centralized control - Need centralized routing and security inspection
- Cost consideration - Break-even vs VPC Peering at ~3+ VPCs
Direct Connect: Dedicated Network Connection
Direct Connect is AWS’s solution for dedicated, private connectivity between your data center and AWS. Unlike VPN connections that go over the public internet, Direct Connect uses a dedicated fiber connection, providing predictable performance and lower latency.
Why use Direct Connect instead of VPN:
- Higher bandwidth: 1 Gbps, 10 Gbps, or 100 Gbps vs VPN’s 1.25 Gbps per tunnel
- Consistent performance: Dedicated bandwidth with predictable latency
- Lower data transfer costs: $0.02/GB vs $0.09/GB for internet egress
- Better security: Traffic never touches the public internet
- Compliance: Some regulations require private connectivity
How it works:
- Order a Direct Connect port at an AWS Direct Connect location (colocation facility)
- Work with a network provider to establish physical connectivity
- Create a Virtual Interface (VIF) to connect to your VPC or Transit Gateway
- Configure BGP routing between your router and AWS
- Traffic flows over the dedicated connection
Setup time: Typically 2-4 weeks (requires physical circuit provisioning)
Common use cases:
- Data migration: Transfer large datasets to AWS (faster than internet upload)
- Hybrid applications: Low-latency connectivity for hybrid cloud workloads
- Disaster recovery: Reliable connection for replication and backup
- Production workloads: Predictable performance for mission-critical applications
Cost:
- Port hours: $0.30/hour for 1Gbps = $216/month
- Data transfer out: $0.02/GB (cheaper than internet)
Comparison: Direct Connect vs Site-to-Site VPN
Bandwidth:
- Direct Connect: 1 Gbps, 10 Gbps, or 100 Gbps dedicated
- Site-to-Site VPN: Up to 1.25 Gbps per tunnel (can use multiple tunnels)
Latency:
- Direct Connect: Low and consistent (private connection)
- Site-to-Site VPN: Higher and variable (depends on internet)
Cost:
- Direct Connect: $216/month (1 Gbps port) + $0.02/GB egress
- Site-to-Site VPN: $36/month (VPN connection) + $0.09/GB egress
Setup Time:
- Direct Connect: 2-4 weeks (physical circuit provisioning)
- Site-to-Site VPN: Minutes (fully self-service)
Reliability:
- Direct Connect: 99.9% SLA (recommend VPN backup)
- Site-to-Site VPN: 99.95% SLA (two tunnels for HA)
Best for:
- Direct Connect: Production workloads, large data transfers, consistent performance needs
- Site-to-Site VPN: Dev/test environments, backup connectivity, quick setup
Pro tip: Use both! Direct Connect for primary connectivity with VPN as backup.
Implementation:
Direct Connect requires a Direct Connect Gateway to connect your on-premises network to multiple VPCs through a Transit Gateway. The configuration below shows how to set up the gateway association and a backup VPN connection. In production, you should always have a VPN backup because Direct Connect is a single physical connection that can fail. The VPN automatically takes over if the Direct Connect link goes down.
Direct Connect with VPN Backup terraform
# Direct Connect Gatewayresource "aws_dx_gateway" "main" { name = "main-dx-gateway" amazon_side_asn = 64512}
# Associate with Transit Gatewayresource "aws_dx_gateway_association" "main" { dx_gateway_id = aws_dx_gateway.main.id associated_gateway_id = aws_ec2_transit_gateway.main.id
allowed_prefixes = [ "10.0.0.0/8", "172.16.0.0/12" ]}
# Site-to-Site VPN as backupresource "aws_vpn_connection" "backup" { customer_gateway_id = aws_customer_gateway.main.id transit_gateway_id = aws_ec2_transit_gateway.main.id type = "ipsec.1" static_routes_only = false
tags = { Name = "backup-vpn" }}AWS Networking Cost Optimization
Monthly Cost Breakdown Example
Production VPC (us-east-1):
NAT Gateways (3 AZs):- Hourly: 3 × $0.045 × 730 hours = $98.55- Data processing: 500GB × $0.045 = $22.50- Subtotal: $121.05
VPC Endpoints (Interface):- ECR API: $7.30- ECR DKR: $7.30- CloudWatch Logs: $7.30- Subtotal: $21.90
Load Balancers:- Application Load Balancer: $16.20 + $8/LCU- Subtotal: ~$25
Data Transfer:- Inter-AZ: 200GB × $0.01 = $2.00- Internet egress: 1TB × $0.09 = $90.00- Subtotal: $92.00
Total Monthly Cost: ~$260Cost Optimization Strategies
Add Gateway Endpoints (FREE)
Deploy S3 and DynamoDB Gateway Endpoints at no cost to eliminate NAT Gateway charges for AWS service traffic. Can save $0.045/GB in data processing fees.
Add Gateway Endpoints (FREE)
Deploy S3 and DynamoDB Gateway Endpoints at no cost to eliminate NAT Gateway charges for AWS service traffic. Can save $0.045/GB in data processing fees.
Single NAT Gateway for Dev/Test
Use one NAT Gateway instead of three in non-production environments to save ~$64/month. Accept the reduced availability for cost savings.
Single NAT Gateway for Dev/Test
Use one NAT Gateway instead of three in non-production environments to save ~$64/month. Accept the reduced availability for cost savings.
Release Unused Resources
Delete unused Elastic IPs ($3.60/month each) and load balancers ($16-25/month each). Regular audits prevent waste.
Release Unused Resources
Delete unused Elastic IPs ($3.60/month each) and load balancers ($16-25/month each). Regular audits prevent waste.
VPC Peering Over Transit Gateway
For simple connectivity between 2-3 VPCs, use VPC Peering (free) instead of Transit Gateway ($36/month + data charges).
VPC Peering Over Transit Gateway
For simple connectivity between 2-3 VPCs, use VPC Peering (free) instead of Transit Gateway ($36/month + data charges).
Interface Endpoints for High Traffic
Add Interface Endpoints ($7/month) for services with >100GB/month traffic. Break-even point makes this cost-effective for ECR, CloudWatch, etc.
Interface Endpoints for High Traffic
Add Interface Endpoints ($7/month) for services with >100GB/month traffic. Break-even point makes this cost-effective for ECR, CloudWatch, etc.
Monitor Data Transfer Patterns
Use VPC Flow Logs and Cost Explorer to identify and optimize expensive cross-AZ and internet data transfer patterns.
Monitor Data Transfer Patterns
Use VPC Flow Logs and Cost Explorer to identify and optimize expensive cross-AZ and internet data transfer patterns.
DNS and Service Discovery
Proper DNS configuration is essential for service-to-service communication. AWS provides several options depending on your architecture.
Route 53 Private Hosted Zones
Private hosted zones allow you to use custom DNS names within your VPC without exposing them to the public internet. This is useful for giving friendly names to internal services, databases, and other resources. You can associate a private hosted zone with multiple VPCs, allowing resources in different VPCs to resolve the same DNS names—useful for shared services architectures.
Route 53 Private Hosted Zone terraform
# Private hosted zoneresource "aws_route53_zone" "private" { name = "internal.mycompany.com"
vpc { vpc_id = module.vpc.vpc_id }
# Associate with additional VPCs if needed lifecycle { ignore_changes = [vpc] }
tags = { Name = "internal-dns" }}
# DNS record for internal serviceresource "aws_route53_record" "api" { zone_id = aws_route53_zone.private.zone_id name = "api.internal.mycompany.com" type = "A"
alias { name = aws_lb.internal.dns_name zone_id = aws_lb.internal.zone_id evaluate_target_health = true }}
# DNS record for databaseresource "aws_route53_record" "db" { zone_id = aws_route53_zone.private.zone_id name = "db.internal.mycompany.com" type = "CNAME" ttl = 300 records = [aws_db_instance.main.address]}Cross-VPC DNS Resolution
When using Transit Gateway or VPC Peering, you may need DNS resolution across VPCs so that resources in one VPC can resolve private DNS names from another VPC. This requires enabling DNS resolution on the peering connection and associating the private hosted zone with the peer VPC. Without this configuration, DNS queries for private hosted zone records will fail from the peered VPC.
Cross-VPC DNS Resolution terraform
# Enable DNS resolution for VPC peeringresource "aws_vpc_peering_connection_options" "requester" { vpc_peering_connection_id = aws_vpc_peering_connection.main.id
requester { allow_remote_vpc_dns_resolution = true }}
resource "aws_vpc_peering_connection_options" "accepter" { vpc_peering_connection_id = aws_vpc_peering_connection.main.id
accepter { allow_remote_vpc_dns_resolution = true }}
# Associate private hosted zone with peered VPCresource "aws_route53_zone_association" "peer" { zone_id = aws_route53_zone.private.zone_id vpc_id = module.peer_vpc.vpc_id}Common AWS Networking Issues
Issue 1: Can’t SSH to EC2 in Private Subnet
Solution: Use AWS Systems Manager Session Manager (no SSH port needed!)
Session Manager provides secure shell access to your EC2 instances without opening inbound ports, managing SSH keys, or using bastion hosts. It works through the SSM agent that’s pre-installed on Amazon Linux and can be installed on other operating systems. All sessions are logged to CloudWatch and S3 for audit purposes.
SSM Session Manager IAM Configuration terraform
# IAM role for EC2 instancesresource "aws_iam_role" "ec2_ssm" { name = "ec2-ssm-role"
assume_role_policy = jsonencode({ Version = "2012-10-17" Statement = [{ Action = "sts:AssumeRole" Effect = "Allow" Principal = { Service = "ec2.amazonaws.com" } }] })}
# Attach SSM policyresource "aws_iam_role_policy_attachment" "ec2_ssm" { role = aws_iam_role.ec2_ssm.name policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"}
# Instance profileresource "aws_iam_instance_profile" "ec2_ssm" { name = "ec2-ssm-profile" role = aws_iam_role.ec2_ssm.name}Connect via CLI:
aws ssm start-session --target i-1234567890abcdef0Issue 2: High NAT Gateway Costs
Diagnosis:
High NAT Gateway costs usually come from unexpected traffic patterns—often AWS service calls that could go through VPC endpoints instead. Use VPC Flow Logs to identify the top traffic sources and destinations, then add appropriate VPC endpoints for AWS services or optimize application behavior for external API calls.
AWS CLI & SQL: Diagnosing NAT Gateway Traffic bash
# Check VPC Flow Logs for traffic patternsaws ec2 describe-flow-logs --filter "Name=resource-id,Values=vpc-xxxxx"
# Analyze with AthenaSELECT srcaddr, dstaddr, SUM(bytes) as total_bytesFROM vpc_flow_logsWHERE action = 'ACCEPT'GROUP BY srcaddr, dstaddrORDER BY total_bytes DESCLIMIT 20;Solution: Add VPC Endpoints for AWS services
Issue 3: VPC Peering Not Working
Checklist:
VPC peering issues usually fall into one of four categories: the peering connection isn’t active, route tables aren’t configured correctly, security groups don’t allow traffic from the peer VPC CIDR, or the VPCs have overlapping CIDR blocks. Work through this checklist systematically to identify the root cause.
AWS CLI: VPC Peering Troubleshooting Commands bash
# 1. Check peering connection statusaws ec2 describe-vpc-peering-connections \ --filters "Name=status-code,Values=active"
# 2. Verify route tables in BOTH VPCsaws ec2 describe-route-tables \ --filters "Name=route.destination-cidr-block,Values=10.1.0.0/16"
# 3. Check security groups allow peer VPC CIDRaws ec2 describe-security-groups --group-ids sg-xxxxx
# 4. Verify no overlapping CIDR blocksTroubleshooting Tools
VPC Reachability Analyzer
VPC Reachability Analyzer is a powerful diagnostic tool that tests connectivity between resources without sending actual network traffic. It analyzes your VPC configuration—route tables, security groups, NACLs, and more—to determine if a path exists between a source and destination. This is invaluable for troubleshooting connectivity issues before or after deployment.
AWS CLI: VPC Reachability Analyzer bash
# Create analysisaws ec2 create-network-insights-path \ --source i-source-instance \ --destination i-dest-instance \ --protocol tcp \ --destination-port 443
# Start analysisaws ec2 start-network-insights-analysis \ --network-insights-path-id nip-xxxxx
# Get resultsaws ec2 describe-network-insights-analyses \ --network-insights-analysis-ids nia-xxxxxVPC Flow Logs Analysis
VPC Flow Logs combined with Athena provide powerful network traffic analysis capabilities. You can identify top talkers to understand bandwidth usage, find rejected connections to diagnose security group issues, and track traffic patterns over time. The queries below are starting points—customize them based on your specific troubleshooting needs.
SQL: Athena Queries for VPC Flow Log Analysis sql
-- Top talkersSELECT srcaddr, dstaddr, SUM(bytes) as total_bytes, COUNT(*) as packet_countFROM vpc_flow_logsWHERE date = '2024-06-23'GROUP BY srcaddr, dstaddrORDER BY total_bytes DESCLIMIT 20;
-- Rejected connections (security issues)SELECT srcaddr, dstaddr, srcport, dstport, COUNT(*) as reject_countFROM vpc_flow_logsWHERE action = 'REJECT'GROUP BY srcaddr, dstaddr, srcport, dstportORDER BY reject_count DESC;Next Steps
You now have the knowledge to build production-ready networks on AWS! For deploying specific AWS services, check out our companion guide:
Continue learning
- Part 1a: AWS Service Networking Guide → - EKS, ECS, Lambda, RDS, and more
- ← Back to Series Overview
- Part 2: Azure Networking Best Practices →
- Part 3: GCP Networking Best Practices →
Additional AWS Resources
Need help? Contact Quabyt for AWS networking architecture and implementation support.