AWS Networking Best Practices: VPC, Transit Gateway, and Beyond
Master AWS networking with this comprehensive guide. Learn VPC design, security groups, Transit Gateway, Direct Connect, and cost optimization strategies with production-ready examples.
AWS Networking: Production-Ready Guide
This is Part 1 of our Cloud Networking series. If you haven’t read the overview, start with Cloud Networking Done Right: Series Overview.
Other parts in this series
Quick Start: Deploy Your First AWS VPC
Get a production-ready VPC running in 5 minutes:
# Save as main.tf and run: terraform init && terraform applymodule "vpc" { source = "terraform-aws-modules/vpc/aws" version = "~> 5.0"
name = "my-production-vpc" cidr = "10.0.0.0/16" # 65,536 IP addresses
# Deploy across 3 AZs for high availability azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
# Private subnets: For application servers, no direct internet access private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
# Public subnets: For load balancers, NAT gateways public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
# Database subnets: Isolated tier for RDS database_subnets = ["10.0.201.0/24", "10.0.202.0/24", "10.0.203.0/24"]
# NAT Gateway configuration - High availability enable_nat_gateway = true single_nat_gateway = false # Set true for dev/test to save ~$64/month one_nat_gateway_per_az = true
# Enable DNS enable_dns_hostnames = true enable_dns_support = true
# FREE VPC Endpoints - Save on NAT Gateway costs enable_s3_endpoint = true # Saves $0.045/GB for S3 traffic enable_dynamodb_endpoint = true # Saves $0.045/GB for DynamoDB traffic
tags = { Environment = "production" ManagedBy = "terraform" }}
# Output for use in other modulesoutput "vpc_id" { value = module.vpc.vpc_id}
output "private_subnet_ids" { value = module.vpc.private_subnets}
Estimated Monthly Cost: $96-150 (3 NAT Gateways + data transfer)
AWS VPC Architecture
Before diving into individual components, let’s understand how they work together in a typical production VPC. This architecture shows a highly available, multi-tier application deployed across two Availability Zones.
What you’re seeing:
- Public subnets host internet-facing resources (load balancers, NAT Gateways)
- Private subnets host application servers with no direct internet access
- Database subnets provide an additional isolation layer for sensitive data
- Multiple AZs ensure high availability—if one AZ fails, the other continues serving traffic
Understanding AWS VPC Components
1. VPC (Virtual Private Cloud)
What it is: Your isolated network in AWS where you launch resources.
Key Characteristics:
- Regional resource (doesn’t span regions)
- Requires CIDR block (e.g., 10.0.0.0/16)
- Can have up to 5 CIDR blocks
- Supports IPv4 and IPv6
VPC Best Practices
- Use
/16
for production VPCs - Provides 65,536 IP addresses for growth - Private IP ranges only - Use 10.0.0.0/8, 172.16.0.0/12, or 192.168.0.0/16
- Plan for growth - Running out of IPs requires complex migration
- Document IP allocation - Maintain clear records to avoid overlaps
2. Subnets
What they are: Segments of your VPC CIDR block, confined to a single Availability Zone.
Types:
- Public Subnet: Has route to Internet Gateway, resources can have public IPs
- Private Subnet: No direct internet access, uses NAT Gateway for outbound
- Database Subnet: Isolated subnet for databases, no internet access
Subnet Best Practices
- Multi-AZ deployment - Create subnets across multiple Availability Zones for high availability
- Use
/24
for most subnets - Provides 256 IPs (AWS reserves 5) - Clear naming convention - Use descriptive names like
prod-public-us-east-1a
,prod-private-us-east-1a
- Reserve ranges - Keep subnet ranges available for future expansion
3. Internet Gateway (IGW)
The Internet Gateway is your VPC’s connection to the public internet. It’s a simple concept but critical to understand: without an IGW, your VPC is completely isolated from the internet.
How it works:
- Attached to your VPC (one IGW per VPC)
- Performs NAT for instances with public IP addresses
- Horizontally scaled, redundant, and highly available by AWS
- No bandwidth constraints or throughput limits
Key Points:
- Completely free—no hourly charges or data processing fees
- Only works for resources with public IP addresses
- Requires a route in your route table pointing
0.0.0.0/0
to the IGW
When to use:
- Public-facing resources like Application Load Balancers
- Bastion hosts that need direct internet access
- Any resource that needs to be reachable from the internet
4. NAT Gateway
NAT Gateway solves a common problem: your private instances need to download updates, call external APIs, or access AWS services, but you don’t want to give them public IPs. NAT Gateway provides outbound internet connectivity while keeping instances completely private.
How it works:
- Deployed in a public subnet (needs internet access itself)
- Private instances route their outbound traffic through the NAT Gateway
- NAT Gateway translates private IPs to its public IP
- Return traffic is automatically routed back to the originating instance
- Important: Only works for outbound traffic—inbound connections are blocked
Cost: $0.045/hour ($32/month) + $0.045/GB data processed
Best Practices:
- High availability: Deploy one NAT Gateway per AZ (prevents single point of failure)
- Cost optimization: Use single NAT Gateway in dev/test environments
- Avoid NAT charges: Use VPC Endpoints for AWS services (S3, DynamoDB, etc.)
- Monitor costs: Data processing fees can add up—review VPC Flow Logs to identify high-traffic sources
Cost Optimization:
# Development/Test: Single NAT Gatewaymodule "vpc_dev" { source = "terraform-aws-modules/vpc/aws"
enable_nat_gateway = true single_nat_gateway = true # Saves $64/month (2 NAT Gateways)}
# Production: One NAT Gateway per AZmodule "vpc_prod" { source = "terraform-aws-modules/vpc/aws"
enable_nat_gateway = true one_nat_gateway_per_az = true # High availability}
5. Route Tables
Route tables are like GPS for your VPC traffic—they determine where network packets go. Every subnet must be associated with a route table, and that route table’s rules determine how traffic flows.
How they work:
- Each route has a destination (CIDR block) and a target (where to send traffic)
- Routes are evaluated from most specific to least specific
- Local routes (within VPC) are automatically added and can’t be deleted
- Each subnet can only be associated with one route table
Key Concepts:
- Main route table: Created automatically with your VPC, used by default for any subnet without explicit association
- Custom route tables: Create these for specific routing needs (public vs private subnets)
- Best practice: Don’t use the main route table for production subnets—create explicit route tables
Example Route Table (Private Subnet):
Destination Target10.0.0.0/16 local (VPC)0.0.0.0/0 nat-xxxxx (NAT Gateway)
Example Route Table (Public Subnet):
Destination Target10.0.0.0/16 local (VPC)0.0.0.0/0 igw-xxxxx (Internet Gateway)
AWS Security: Defense in Depth
AWS networking security follows a “defense in depth” strategy—multiple layers of security controls that work together. If one layer is breached, others provide backup protection. Think of it like a castle with multiple walls, moats, and gates.
The security layers:
Security Groups (Most Important!)
Security Groups are your most important security control in AWS. They act as virtual firewalls for your EC2 instances, controlling both inbound and outbound traffic. Understanding security groups is essential for any AWS deployment.
What makes them special:
- Stateful: If you allow inbound traffic on port 443, the response traffic is automatically allowed back out—you don’t need a separate outbound rule
- Deny by default: All inbound traffic is blocked unless you explicitly allow it. All outbound traffic is allowed by default
- Instance-level: Applied to ENIs (Elastic Network Interfaces), not subnets
- Dynamic: Changes take effect immediately—no need to restart instances
- Security group references: You can allow traffic from another security group instead of IP ranges (powerful for microservices)
Common mistake: New users often confuse security groups with NACLs. Security groups are stateful and deny by default; NACLs are stateless and require explicit allow/deny rules.
Production Example:
# Web tier security groupresource "aws_security_group" "web_tier" { name = "web-tier-sg" description = "Security group for web tier" vpc_id = module.vpc.vpc_id
# Allow HTTPS from anywhere ingress { description = "HTTPS from internet" from_port = 443 to_port = 443 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] }
# Allow HTTP (redirect to HTTPS at ALB) ingress { description = "HTTP from internet" from_port = 80 to_port = 80 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] }
# Outbound to app tier only egress { description = "To app tier" from_port = 8080 to_port = 8080 protocol = "tcp" security_groups = [aws_security_group.app_tier.id] }
tags = { Name = "web-tier-sg" Tier = "web" }}
# Application tier security groupresource "aws_security_group" "app_tier" { name = "app-tier-sg" description = "Security group for application tier" vpc_id = module.vpc.vpc_id
# Only allow traffic from web tier ingress { description = "From web tier" from_port = 8080 to_port = 8080 protocol = "tcp" security_groups = [aws_security_group.web_tier.id] }
# Outbound to database tier only egress { description = "To database tier" from_port = 5432 to_port = 5432 protocol = "tcp" security_groups = [aws_security_group.db_tier.id] }
# Outbound HTTPS for API calls egress { description = "HTTPS for external APIs" from_port = 443 to_port = 443 protocol = "tcp" cidr_blocks = ["0.0.0.0/0"] }
tags = { Name = "app-tier-sg" Tier = "application" }}
# Database tier security groupresource "aws_security_group" "db_tier" { name = "db-tier-sg" description = "Security group for database tier" vpc_id = module.vpc.vpc_id
# Only allow traffic from app tier ingress { description = "PostgreSQL from app tier" from_port = 5432 to_port = 5432 protocol = "tcp" security_groups = [aws_security_group.app_tier.id] }
# No outbound internet access # (Add specific egress rules if needed)
tags = { Name = "db-tier-sg" Tier = "database" }}
Network ACLs (NACLs)
Network ACLs are often misunderstood and overused. They’re stateless firewalls that operate at the subnet level, providing an additional layer of security beyond Security Groups. However, for most use cases, Security Groups alone are sufficient.
Key difference from Security Groups:
- Stateless: Unlike Security Groups, NACLs don’t track connection state. If you allow inbound traffic on port 443, you must also explicitly allow the return traffic on ephemeral ports (1024-65535)
- Subnet-level: Applied to entire subnets, not individual instances
- Rule evaluation: Rules are numbered and evaluated in order (lowest number first)
- Allow and Deny: Can explicitly deny traffic (Security Groups can only allow)
When to use NACLs:
- Block specific IPs: Deny traffic from known malicious IP ranges
- Compliance requirements: Some regulations require subnet-level controls
- Additional defense layer: Defense in depth strategy
- Temporary blocks: Quick way to block traffic to an entire subnet
Best Practice: Start with the default NACL (allows all traffic) and only add custom NACLs when you have a specific security requirement. Most applications don’t need them.
VPC Flow Logs
VPC Flow Logs are your network traffic recorder—they capture information about IP traffic going to and from network interfaces in your VPC. Think of them as your VPC’s black box, essential for troubleshooting, security analysis, and cost optimization.
What they capture:
- Source and destination IP addresses
- Source and destination ports
- Protocol (TCP, UDP, ICMP)
- Number of packets and bytes
- Action taken (ACCEPT or REJECT)
- Timestamps
Why you need them:
- Troubleshooting: Diagnose why connections are failing (security group rules, routing issues)
- Security analysis: Detect unusual traffic patterns, potential attacks, or data exfiltration
- Cost optimization: Identify which resources are generating high data transfer costs
- Compliance: Many regulations require network traffic logging
- Forensics: Investigate security incidents after they occur
Cost consideration: Flow Logs are charged based on the amount of data ingested (~$0.50 per GB). For high-traffic VPCs, this can add up. Consider sampling or filtering to specific subnets.
Implementation:
# VPC Flow Logs to CloudWatchresource "aws_flow_log" "vpc_flow_logs" { log_destination = aws_cloudwatch_log_group.flow_logs.arn log_destination_type = "cloud-watch-logs" traffic_type = "ALL" # or "ACCEPT" or "REJECT" vpc_id = module.vpc.vpc_id iam_role_arn = aws_iam_role.flow_logs.arn
tags = { Name = "vpc-flow-logs" }}
resource "aws_cloudwatch_log_group" "flow_logs" { name = "/aws/vpc-flow-log/main-vpc" retention_in_days = 90 # Adjust based on compliance needs
tags = { Name = "vpc-flow-logs" }}
# VPC Flow Logs to S3 (cheaper for long-term storage)resource "aws_flow_log" "vpc_flow_logs_s3" { log_destination = aws_s3_bucket.flow_logs.arn log_destination_type = "s3" traffic_type = "ALL" vpc_id = module.vpc.vpc_id
# Parquet format for Athena queries destination_options { file_format = "parquet" per_hour_partition = true }
tags = { Name = "vpc-flow-logs-s3" }}
VPC Endpoints: Save Money on NAT Gateway
Here’s a common problem: Your EC2 instances in private subnets need to access AWS services like S3 or DynamoDB. Without VPC Endpoints, this traffic goes through your NAT Gateway, costing you $0.045/GB in data processing fees. For high-traffic applications, this can mean hundreds of dollars per month in unnecessary costs.
The solution: VPC Endpoints provide private connectivity to AWS services without going through the NAT Gateway or internet. Traffic stays on AWS’s private network, improving security and reducing costs.
Two types of VPC Endpoints:
- Gateway Endpoints (FREE): For S3 and DynamoDB only
- Interface Endpoints ($7/month + data): For most other AWS services
Cost savings example: If your application transfers 1TB/month to S3 through NAT Gateway:
- Without VPC Endpoint: 1000 GB × $0.045 = $45/month
- With Gateway Endpoint: $0/month
That’s $540/year saved per TB of S3 traffic!
Gateway Endpoints (FREE!)
For S3 and DynamoDB only:
# S3 Gateway Endpoint - FREE, no hourly chargesresource "aws_vpc_endpoint" "s3" { vpc_id = module.vpc.vpc_id service_name = "com.amazonaws.${var.region}.s3" vpc_endpoint_type = "Gateway"
# Associate with private subnet route tables route_table_ids = module.vpc.private_route_table_ids
tags = { Name = "s3-gateway-endpoint" }}
# DynamoDB Gateway Endpoint - FREEresource "aws_vpc_endpoint" "dynamodb" { vpc_id = module.vpc.vpc_id service_name = "com.amazonaws.${var.region}.dynamodb" vpc_endpoint_type = "Gateway" route_table_ids = module.vpc.private_route_table_ids
tags = { Name = "dynamodb-gateway-endpoint" }}
Savings: $0.045/GB that would go through NAT Gateway
Interface Endpoints (Paid)
For other AWS services:
Cost: $0.01/hour ($7/month) + $0.01/GB data processed
# ECR API Endpointresource "aws_vpc_endpoint" "ecr_api" { vpc_id = module.vpc.vpc_id service_name = "com.amazonaws.${var.region}.ecr.api" vpc_endpoint_type = "Interface" subnet_ids = module.vpc.private_subnets security_group_ids = [aws_security_group.vpc_endpoints.id] private_dns_enabled = true
tags = { Name = "ecr-api-endpoint" }}
# ECR Docker Endpointresource "aws_vpc_endpoint" "ecr_dkr" { vpc_id = module.vpc.vpc_id service_name = "com.amazonaws.${var.region}.ecr.dkr" vpc_endpoint_type = "Interface" subnet_ids = module.vpc.private_subnets security_group_ids = [aws_security_group.vpc_endpoints.id] private_dns_enabled = true
tags = { Name = "ecr-dkr-endpoint" }}
# Security group for VPC endpointsresource "aws_security_group" "vpc_endpoints" { name = "vpc-endpoints-sg" description = "Security group for VPC endpoints" vpc_id = module.vpc.vpc_id
ingress { description = "HTTPS from VPC" from_port = 443 to_port = 443 protocol = "tcp" cidr_blocks = [module.vpc.vpc_cidr_block] }
tags = { Name = "vpc-endpoints-sg" }}
When to use Interface Endpoints:
- High traffic to AWS services (>100GB/month)
- Break-even point: ~$7/month endpoint vs NAT Gateway data processing
- Services: ECR, CloudWatch Logs, Systems Manager, Secrets Manager
Transit Gateway: Hub-and-Spoke Architecture
As your AWS environment grows, connecting multiple VPCs becomes complex. VPC Peering works for 2-3 VPCs, but with 5+ VPCs, you’d need to create dozens of peering connections (N×(N-1)/2 connections). Transit Gateway solves this by acting as a central hub—each VPC connects once to the Transit Gateway, and the Transit Gateway handles routing between all VPCs.
Why use Transit Gateway:
- Simplified connectivity: Connect 10 VPCs with 10 attachments instead of 45 peering connections
- Centralized routing: Manage all inter-VPC routing in one place
- On-premises integration: Connect your data center once, reach all VPCs
- Scalability: Supports thousands of VPCs per Transit Gateway
- Segmentation: Use route tables to control which VPCs can talk to each other
How it works:
- Create a Transit Gateway in your region
- Attach VPCs to the Transit Gateway (one attachment per VPC)
- Configure route tables on the Transit Gateway to control traffic flow
- Update VPC route tables to send inter-VPC traffic to the Transit Gateway
Cost: $0.05/hour ($36/month per Transit Gateway) + $0.02/GB data processed
Break-even point: Transit Gateway becomes cost-effective when you have 3+ VPCs that need to communicate. Below that, VPC Peering (free) is usually cheaper.
Implementation:
# Transit Gatewayresource "aws_ec2_transit_gateway" "main" { description = "Main Transit Gateway" default_route_table_association = "enable" default_route_table_propagation = "enable" dns_support = "enable" vpn_ecmp_support = "enable"
tags = { Name = "main-tgw" }}
# Attach Production VPCresource "aws_ec2_transit_gateway_vpc_attachment" "production" { subnet_ids = module.production_vpc.private_subnets transit_gateway_id = aws_ec2_transit_gateway.main.id vpc_id = module.production_vpc.vpc_id dns_support = "enable"
tags = { Name = "tgw-attachment-production" }}
# Attach Development VPCresource "aws_ec2_transit_gateway_vpc_attachment" "development" { subnet_ids = module.development_vpc.private_subnets transit_gateway_id = aws_ec2_transit_gateway.main.id vpc_id = module.development_vpc.vpc_id dns_support = "enable"
tags = { Name = "tgw-attachment-development" }}
# Route from Production VPC to Transit Gatewayresource "aws_route" "production_to_tgw" { count = length(module.production_vpc.private_route_table_ids) route_table_id = module.production_vpc.private_route_table_ids[count.index] destination_cidr_block = "10.0.0.0/8" # All internal traffic transit_gateway_id = aws_ec2_transit_gateway.main.id}
When to Use Transit Gateway
- Multiple VPCs - 3+ VPCs that need to communicate with each other
- Hybrid cloud - On-premises connectivity via VPN or Direct Connect
- Centralized control - Need centralized routing and security inspection
- Cost consideration - Break-even vs VPC Peering at ~3+ VPCs
Direct Connect: Dedicated Network Connection
Direct Connect is AWS’s solution for dedicated, private connectivity between your data center and AWS. Unlike VPN connections that go over the public internet, Direct Connect uses a dedicated fiber connection, providing predictable performance and lower latency.
Why use Direct Connect instead of VPN:
- Higher bandwidth: 1 Gbps, 10 Gbps, or 100 Gbps vs VPN’s 1.25 Gbps per tunnel
- Consistent performance: Dedicated bandwidth with predictable latency
- Lower data transfer costs: $0.02/GB vs $0.09/GB for internet egress
- Better security: Traffic never touches the public internet
- Compliance: Some regulations require private connectivity
How it works:
- Order a Direct Connect port at an AWS Direct Connect location (colocation facility)
- Work with a network provider to establish physical connectivity
- Create a Virtual Interface (VIF) to connect to your VPC or Transit Gateway
- Configure BGP routing between your router and AWS
- Traffic flows over the dedicated connection
Setup time: Typically 2-4 weeks (requires physical circuit provisioning)
Common use cases:
- Data migration: Transfer large datasets to AWS (faster than internet upload)
- Hybrid applications: Low-latency connectivity for hybrid cloud workloads
- Disaster recovery: Reliable connection for replication and backup
- Production workloads: Predictable performance for mission-critical applications
Cost:
- Port hours: $0.30/hour for 1Gbps = $216/month
- Data transfer out: $0.02/GB (cheaper than internet)
Comparison: Direct Connect vs Site-to-Site VPN
Bandwidth:
- Direct Connect: 1 Gbps, 10 Gbps, or 100 Gbps dedicated
- Site-to-Site VPN: Up to 1.25 Gbps per tunnel (can use multiple tunnels)
Latency:
- Direct Connect: Low and consistent (private connection)
- Site-to-Site VPN: Higher and variable (depends on internet)
Cost:
- Direct Connect: $216/month (1 Gbps port) + $0.02/GB egress
- Site-to-Site VPN: $36/month (VPN connection) + $0.09/GB egress
Setup Time:
- Direct Connect: 2-4 weeks (physical circuit provisioning)
- Site-to-Site VPN: Minutes (fully self-service)
Reliability:
- Direct Connect: 99.9% SLA (recommend VPN backup)
- Site-to-Site VPN: 99.95% SLA (two tunnels for HA)
Best for:
- Direct Connect: Production workloads, large data transfers, consistent performance needs
- Site-to-Site VPN: Dev/test environments, backup connectivity, quick setup
Pro tip: Use both! Direct Connect for primary connectivity with VPN as backup.
Implementation:
# Direct Connect Gatewayresource "aws_dx_gateway" "main" { name = "main-dx-gateway" amazon_side_asn = 64512}
# Associate with Transit Gatewayresource "aws_dx_gateway_association" "main" { dx_gateway_id = aws_dx_gateway.main.id associated_gateway_id = aws_ec2_transit_gateway.main.id
allowed_prefixes = [ "10.0.0.0/8", "172.16.0.0/12" ]}
# Site-to-Site VPN as backupresource "aws_vpn_connection" "backup" { customer_gateway_id = aws_customer_gateway.main.id transit_gateway_id = aws_ec2_transit_gateway.main.id type = "ipsec.1" static_routes_only = false
tags = { Name = "backup-vpn" }}
AWS Networking Cost Optimization
Monthly Cost Breakdown Example
Production VPC (us-east-1):
NAT Gateways (3 AZs):- Hourly: 3 × $0.045 × 730 hours = $98.55- Data processing: 500GB × $0.045 = $22.50- Subtotal: $121.05
VPC Endpoints (Interface):- ECR API: $7.30- ECR DKR: $7.30- CloudWatch Logs: $7.30- Subtotal: $21.90
Load Balancers:- Application Load Balancer: $16.20 + $8/LCU- Subtotal: ~$25
Data Transfer:- Inter-AZ: 200GB × $0.01 = $2.00- Internet egress: 1TB × $0.09 = $90.00- Subtotal: $92.00
Total Monthly Cost: ~$260
Cost Optimization Strategies
Add Gateway Endpoints (FREE)
Deploy S3 and DynamoDB Gateway Endpoints at no cost to eliminate NAT Gateway charges for AWS service traffic. Can save $0.045/GB in data processing fees.
Add Gateway Endpoints (FREE)
Deploy S3 and DynamoDB Gateway Endpoints at no cost to eliminate NAT Gateway charges for AWS service traffic. Can save $0.045/GB in data processing fees.
Single NAT Gateway for Dev/Test
Use one NAT Gateway instead of three in non-production environments to save ~$64/month. Accept the reduced availability for cost savings.
Single NAT Gateway for Dev/Test
Use one NAT Gateway instead of three in non-production environments to save ~$64/month. Accept the reduced availability for cost savings.
Release Unused Resources
Delete unused Elastic IPs ($3.60/month each) and load balancers ($16-25/month each). Regular audits prevent waste.
Release Unused Resources
Delete unused Elastic IPs ($3.60/month each) and load balancers ($16-25/month each). Regular audits prevent waste.
VPC Peering Over Transit Gateway
For simple connectivity between 2-3 VPCs, use VPC Peering (free) instead of Transit Gateway ($36/month + data charges).
VPC Peering Over Transit Gateway
For simple connectivity between 2-3 VPCs, use VPC Peering (free) instead of Transit Gateway ($36/month + data charges).
Interface Endpoints for High Traffic
Add Interface Endpoints ($7/month) for services with >100GB/month traffic. Break-even point makes this cost-effective for ECR, CloudWatch, etc.
Interface Endpoints for High Traffic
Add Interface Endpoints ($7/month) for services with >100GB/month traffic. Break-even point makes this cost-effective for ECR, CloudWatch, etc.
Monitor Data Transfer Patterns
Use VPC Flow Logs and Cost Explorer to identify and optimize expensive cross-AZ and internet data transfer patterns.
Monitor Data Transfer Patterns
Use VPC Flow Logs and Cost Explorer to identify and optimize expensive cross-AZ and internet data transfer patterns.
Common AWS Networking Issues
Issue 1: Can’t SSH to EC2 in Private Subnet
Solution: Use AWS Systems Manager Session Manager (no SSH port needed!)
# IAM role for EC2 instancesresource "aws_iam_role" "ec2_ssm" { name = "ec2-ssm-role"
assume_role_policy = jsonencode({ Version = "2012-10-17" Statement = [{ Action = "sts:AssumeRole" Effect = "Allow" Principal = { Service = "ec2.amazonaws.com" } }] })}
# Attach SSM policyresource "aws_iam_role_policy_attachment" "ec2_ssm" { role = aws_iam_role.ec2_ssm.name policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"}
# Instance profileresource "aws_iam_instance_profile" "ec2_ssm" { name = "ec2-ssm-profile" role = aws_iam_role.ec2_ssm.name}
Connect via CLI:
aws ssm start-session --target i-1234567890abcdef0
Issue 2: High NAT Gateway Costs
Diagnosis:
# Check VPC Flow Logs for traffic patternsaws ec2 describe-flow-logs --filter "Name=resource-id,Values=vpc-xxxxx"
# Analyze with AthenaSELECT srcaddr, dstaddr, SUM(bytes) as total_bytesFROM vpc_flow_logsWHERE action = 'ACCEPT'GROUP BY srcaddr, dstaddrORDER BY total_bytes DESCLIMIT 20;
Solution: Add VPC Endpoints for AWS services
Issue 3: VPC Peering Not Working
Checklist:
# 1. Check peering connection statusaws ec2 describe-vpc-peering-connections \ --filters "Name=status-code,Values=active"
# 2. Verify route tables in BOTH VPCsaws ec2 describe-route-tables \ --filters "Name=route.destination-cidr-block,Values=10.1.0.0/16"
# 3. Check security groups allow peer VPC CIDRaws ec2 describe-security-groups --group-ids sg-xxxxx
# 4. Verify no overlapping CIDR blocks
Troubleshooting Tools
VPC Reachability Analyzer
Test connectivity without sending packets:
# Create analysisaws ec2 create-network-insights-path \ --source i-source-instance \ --destination i-dest-instance \ --protocol tcp \ --destination-port 443
# Start analysisaws ec2 start-network-insights-analysis \ --network-insights-path-id nip-xxxxx
# Get resultsaws ec2 describe-network-insights-analyses \ --network-insights-analysis-ids nia-xxxxx
VPC Flow Logs Analysis
Query with Athena:
-- Top talkersSELECT srcaddr, dstaddr, SUM(bytes) as total_bytes, COUNT(*) as packet_countFROM vpc_flow_logsWHERE date = '2024-06-23'GROUP BY srcaddr, dstaddrORDER BY total_bytes DESCLIMIT 20;
-- Rejected connections (security issues)SELECT srcaddr, dstaddr, srcport, dstport, COUNT(*) as reject_countFROM vpc_flow_logsWHERE action = 'REJECT'GROUP BY srcaddr, dstaddr, srcport, dstportORDER BY reject_count DESC;
Next Steps
You now have the knowledge to build production-ready networks on AWS!
Continue the series
- ← Back to Series Overview
- Part 2: Azure Networking Best Practices →
- Part 3: GCP Networking Best Practices →
Additional AWS Resources
Need help? Contact Quabyt for AWS networking architecture and implementation support.