AWS Networking Best Practices: VPC, Transit Gateway, and Beyond

Master AWS networking with this comprehensive guide. Learn VPC design, security groups, Transit Gateway, Direct Connect, and cost optimization strategies with production-ready examples.

Jun 29, 2024
25 min read
Share:

AWS Networking: Production-Ready Guide

This is Part 1 of our Cloud Networking series. If you haven’t read the overview, start with Cloud Networking Done Right: Series Overview.

Other parts in this series

Quick Start: Deploy Your First AWS VPC

Want to get a production-ready VPC running quickly? This Terraform configuration gives you a solid foundation with all the essentials: multi-AZ deployment for high availability, properly segmented subnets for different workload tiers, NAT Gateways for secure outbound connectivity, and free VPC endpoints to reduce data transfer costs. You can customize the CIDR ranges and AZ selections based on your region and IP planning requirements.

Production VPC Module Configuration
terraform
# Save as main.tf and run: terraform init && terraform apply
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 5.0"
name = "my-production-vpc"
cidr = "10.0.0.0/16" # 65,536 IP addresses
# Deploy across 3 AZs for high availability
azs = ["us-east-1a", "us-east-1b", "us-east-1c"]
# Private subnets: For application servers, no direct internet access
private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
# Public subnets: For load balancers, NAT gateways
public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]
# Database subnets: Isolated tier for RDS
database_subnets = ["10.0.201.0/24", "10.0.202.0/24", "10.0.203.0/24"]
# NAT Gateway configuration - High availability
enable_nat_gateway = true
single_nat_gateway = false # Set true for dev/test to save ~$64/month
one_nat_gateway_per_az = true
# Enable DNS
enable_dns_hostnames = true
enable_dns_support = true
# FREE VPC Endpoints - Save on NAT Gateway costs
enable_s3_endpoint = true # Saves $0.045/GB for S3 traffic
enable_dynamodb_endpoint = true # Saves $0.045/GB for DynamoDB traffic
tags = {
Environment = "production"
ManagedBy = "terraform"
}
}
# Output for use in other modules
output "vpc_id" {
value = module.vpc.vpc_id
}
output "private_subnet_ids" {
value = module.vpc.private_subnets
}

Estimated Monthly Cost: $96-150 (3 NAT Gateways + data transfer)

AWS VPC Architecture

Before diving into individual components, let’s understand how they work together in a typical production VPC. This architecture shows a highly available, multi-tier application deployed across two Availability Zones.

What you’re seeing:

  • Public subnets host internet-facing resources (load balancers, NAT Gateways)
  • Private subnets host application servers with no direct internet access
  • Database subnets provide an additional isolation layer for sensitive data
  • Multiple AZs ensure high availability - if one AZ fails, the other continues serving traffic

Understanding AWS VPC Components

1. VPC (Virtual Private Cloud)

What it is: Your isolated network in AWS where you launch resources.

Key Characteristics:

  • Regional resource (doesn’t span regions)
  • Requires CIDR block (e.g., 10.0.0.0/16)
  • Default limit of 5 IPv4 CIDR blocks per VPC (can request increase)
  • Supports IPv4 and IPv6 (dual-stack)

VPC Best Practices

  • Use /16 for production VPCs - Provides 65,536 IP addresses for growth
  • Private IP ranges only - Use 10.0.0.0/8, 172.16.0.0/12, or 192.168.0.0/16
  • Plan for growth - Running out of IPs requires complex migration
  • Document IP allocation - Maintain clear records to avoid overlaps

2. Subnets

What they are: Segments of your VPC CIDR block, confined to a single Availability Zone.

Types:

  • Public Subnet: Has route to Internet Gateway, resources can have public IPs
  • Private Subnet: No direct internet access, uses NAT Gateway for outbound
  • Database Subnet: Isolated subnet for databases, no internet access

Subnet Best Practices

  • Multi-AZ deployment - Create subnets across multiple Availability Zones for high availability
  • Use /24 for most subnets - Provides 256 IPs (AWS reserves 5)
  • Clear naming convention - Use descriptive names like prod-public-us-east-1a, prod-private-us-east-1a
  • Reserve ranges - Keep subnet ranges available for future expansion

3. Internet Gateway (IGW)

The Internet Gateway is your VPC’s connection to the public internet. It’s a simple concept but critical to understand: without an IGW, your VPC is completely isolated from the internet.

How it works:

  • Attached to your VPC (one IGW per VPC)
  • Performs NAT for instances with public IP addresses
  • Horizontally scaled, redundant, and highly available by AWS
  • No bandwidth constraints or throughput limits

Key Points:

  • Completely free - no hourly charges or data processing fees
  • Only works for resources with public IP addresses
  • Requires a route in your route table pointing 0.0.0.0/0 to the IGW

When to use:

  • Public-facing resources like Application Load Balancers
  • Bastion hosts that need direct internet access
  • Any resource that needs to be reachable from the internet

4. NAT Gateway

NAT Gateway solves a common problem: your private instances need to download updates, call external APIs, or access AWS services, but you don’t want to give them public IPs. NAT Gateway provides outbound internet connectivity while keeping instances completely private.

How it works:

  • Deployed in a public subnet (needs internet access itself)
  • Private instances route their outbound traffic through the NAT Gateway
  • NAT Gateway translates private IPs to its public IP
  • Return traffic is automatically routed back to the originating instance
  • Important: Only works for outbound traffic - inbound connections are blocked

Cost: $0.045/hour ($32/month) + $0.045/GB data processed

Performance: Scales automatically up to 100 Gbps bandwidth per NAT Gateway

Best Practices:

  • High availability: Deploy one NAT Gateway per AZ (prevents single point of failure)
  • Cost optimization: Use single NAT Gateway in dev/test environments
  • Avoid NAT charges: Use VPC Endpoints for AWS services (S3, DynamoDB, etc.)
  • Monitor costs: Data processing fees can add up—review VPC Flow Logs to identify high-traffic sources

Cost Optimization:

Choosing the right NAT Gateway configuration can significantly impact your monthly AWS bill. For development and test environments where high availability isn’t critical, a single NAT Gateway saves around $64/month. Production environments should use one NAT Gateway per Availability Zone to ensure that if one AZ experiences issues, workloads in other AZs can still reach the internet independently.

NAT Gateway Cost Optimization
terraform
# Development/Test: Single NAT Gateway
module "vpc_dev" {
source = "terraform-aws-modules/vpc/aws"
enable_nat_gateway = true
single_nat_gateway = true # Saves $64/month (2 NAT Gateways)
}
# Production: One NAT Gateway per AZ
module "vpc_prod" {
source = "terraform-aws-modules/vpc/aws"
enable_nat_gateway = true
one_nat_gateway_per_az = true # High availability
}

5. Route Tables

Route tables are like GPS for your VPC traffic—they determine where network packets go. Every subnet must be associated with a route table, and that route table’s rules determine how traffic flows.

How they work:

  • Each route has a destination (CIDR block) and a target (where to send traffic)
  • Routes are evaluated from most specific to least specific
  • Local routes (within VPC) are automatically added and can’t be deleted
  • Each subnet can only be associated with one route table

Key Concepts:

  • Main route table: Created automatically with your VPC, used by default for any subnet without explicit association
  • Custom route tables: Create these for specific routing needs (public vs private subnets)
  • Best practice: Don’t use the main route table for production subnets—create explicit route tables

Example Route Table (Private Subnet):

Destination Target
10.0.0.0/16 local (VPC)
0.0.0.0/0 nat-xxxxx (NAT Gateway)

Example Route Table (Public Subnet):

Destination Target
10.0.0.0/16 local (VPC)
0.0.0.0/0 igw-xxxxx (Internet Gateway)

6. Elastic IPs

Elastic IPs are static, public IPv4 addresses that you can allocate to your AWS account and associate with resources. They’re essential when you need a predictable public IP address.

Common use cases:

  • NAT Gateways: Each NAT Gateway requires an Elastic IP
  • Whitelisting: Third-party APIs that require IP whitelisting
  • DNS records: When you need a static IP for A records
  • Failover: Quickly remap to a standby instance during failures

Cost: Free when associated with a running resource. $0.005/hour (~$3.60/month) when unassociated—AWS charges for idle Elastic IPs to encourage efficient use.

The following Terraform configuration demonstrates how to allocate Elastic IPs for NAT Gateways across multiple Availability Zones. This pattern is common in production environments where you need predictable outbound IP addresses for whitelisting with third-party services or compliance requirements.

Elastic IP Allocation for NAT Gateways
terraform
# Allocate Elastic IPs for NAT Gateways
resource "aws_eip" "nat" {
count = 3 # One per AZ
domain = "vpc"
tags = {
Name = "nat-eip-${count.index + 1}"
}
}
# Associate with NAT Gateway
resource "aws_nat_gateway" "main" {
count = 3
allocation_id = aws_eip.nat[count.index].id
subnet_id = module.vpc.public_subnets[count.index]
tags = {
Name = "nat-gw-${count.index + 1}"
}
}
# Output for whitelisting with third-party services
output "nat_public_ips" {
description = "Public IPs for outbound traffic (whitelist these)"
value = aws_eip.nat[*].public_ip
}

Elastic IP Best Practices

  • Release unused EIPs: Audit regularly to avoid charges for unassociated IPs
  • Document allocations: Track which EIPs are whitelisted with external services
  • Use tags: Tag EIPs with their purpose for easy identification
  • Limit of 5 per region: Default quota, can request increase if needed

AWS Security: Defense in Depth

AWS networking security follows a “defense in depth” strategy—multiple layers of security controls that work together. If one layer is breached, others provide backup protection. Think of it like a castle with multiple walls, moats, and gates.

The security layers:

Security Groups (Most Important!)

Security Groups are your most important security control in AWS. They act as virtual firewalls for your EC2 instances, controlling both inbound and outbound traffic. Understanding security groups is essential for any AWS deployment.

What makes them special:

  • Stateful: If you allow inbound traffic on port 443, the response traffic is automatically allowed back out—you don’t need a separate outbound rule
  • Deny by default: All inbound traffic is blocked unless you explicitly allow it. All outbound traffic is allowed by default
  • Instance-level: Applied to ENIs (Elastic Network Interfaces), not subnets
  • Dynamic: Changes take effect immediately—no need to restart instances
  • Security group references: You can allow traffic from another security group instead of IP ranges (powerful for microservices)

Limits to know:

  • 60 inbound + 60 outbound rules per security group (default, can increase)
  • 5 security groups per ENI (default, can increase to 16)
  • 2,500 security groups per VPC (default)

Common mistake: New users often confuse security groups with NACLs. Security groups are stateful and deny by default; NACLs are stateless and require explicit allow/deny rules.

Production Example:

This comprehensive example demonstrates a three-tier security group architecture commonly used in production environments. The web tier accepts traffic from the internet, the application tier only accepts traffic from the web tier, and the database tier only accepts traffic from the application tier. This creates a defense-in-depth approach where each layer can only communicate with its adjacent layers, significantly reducing the attack surface.

Three-Tier Security Group Architecture
terraform
# Web tier security group
resource "aws_security_group" "web_tier" {
name = "web-tier-sg"
description = "Security group for web tier"
vpc_id = module.vpc.vpc_id
# Allow HTTPS from anywhere
ingress {
description = "HTTPS from internet"
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
# Allow HTTP (redirect to HTTPS at ALB)
ingress {
description = "HTTP from internet"
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
# Outbound to app tier only
egress {
description = "To app tier"
from_port = 8080
to_port = 8080
protocol = "tcp"
security_groups = [aws_security_group.app_tier.id]
}
tags = {
Name = "web-tier-sg"
Tier = "web"
}
}
# Application tier security group
resource "aws_security_group" "app_tier" {
name = "app-tier-sg"
description = "Security group for application tier"
vpc_id = module.vpc.vpc_id
# Only allow traffic from web tier
ingress {
description = "From web tier"
from_port = 8080
to_port = 8080
protocol = "tcp"
security_groups = [aws_security_group.web_tier.id]
}
# Outbound to database tier only
egress {
description = "To database tier"
from_port = 5432
to_port = 5432
protocol = "tcp"
security_groups = [aws_security_group.db_tier.id]
}
# Outbound HTTPS for API calls
egress {
description = "HTTPS for external APIs"
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
tags = {
Name = "app-tier-sg"
Tier = "application"
}
}
# Database tier security group
resource "aws_security_group" "db_tier" {
name = "db-tier-sg"
description = "Security group for database tier"
vpc_id = module.vpc.vpc_id
# Only allow traffic from app tier
ingress {
description = "PostgreSQL from app tier"
from_port = 5432
to_port = 5432
protocol = "tcp"
security_groups = [aws_security_group.app_tier.id]
}
# No outbound internet access
# (Add specific egress rules if needed)
tags = {
Name = "db-tier-sg"
Tier = "database"
}
}

Network ACLs (NACLs)

Network ACLs are often misunderstood and overused. They’re stateless firewalls that operate at the subnet level, providing an additional layer of security beyond Security Groups. However, for most use cases, Security Groups alone are sufficient.

Key difference from Security Groups:

  • Stateless: Unlike Security Groups, NACLs don’t track connection state. If you allow inbound traffic on port 443, you must also explicitly allow the return traffic on ephemeral ports (1024-65535)
  • Subnet-level: Applied to entire subnets, not individual instances
  • Rule evaluation: Rules are numbered and evaluated in order (lowest number first)
  • Allow and Deny: Can explicitly deny traffic (Security Groups can only allow)

When to use NACLs:

  • Block specific IPs: Deny traffic from known malicious IP ranges
  • Compliance requirements: Some regulations require subnet-level controls
  • Additional defense layer: Defense in depth strategy
  • Temporary blocks: Quick way to block traffic to an entire subnet

Best Practice: Start with the default NACL (allows all traffic) and only add custom NACLs when you have a specific security requirement. Most applications don’t need them.

VPC Flow Logs

VPC Flow Logs are your network traffic recorder—they capture information about IP traffic going to and from network interfaces in your VPC. Think of them as your VPC’s black box, essential for troubleshooting, security analysis, and cost optimization.

What they capture:

  • Source and destination IP addresses
  • Source and destination ports
  • Protocol (TCP, UDP, ICMP)
  • Number of packets and bytes
  • Action taken (ACCEPT or REJECT)
  • Timestamps

Why you need them:

  • Troubleshooting: Diagnose why connections are failing (security group rules, routing issues)
  • Security analysis: Detect unusual traffic patterns, potential attacks, or data exfiltration
  • Cost optimization: Identify which resources are generating high data transfer costs
  • Compliance: Many regulations require network traffic logging
  • Forensics: Investigate security incidents after they occur

Cost consideration: Flow Logs are charged based on the amount of data ingested (~$0.50 per GB). For high-traffic VPCs, this can add up. Consider sampling or filtering to specific subnets.

Implementation:

VPC Flow Logs can be sent to either CloudWatch Logs for real-time analysis and alerting, or to S3 for cost-effective long-term storage and batch analysis with Athena. The CloudWatch option is ideal for operational monitoring and quick troubleshooting, while S3 with Parquet format is better for compliance requirements and historical analysis. Most production environments use both destinations for different retention periods.

VPC Flow Logs Configuration
terraform
# VPC Flow Logs to CloudWatch
resource "aws_flow_log" "vpc_flow_logs" {
log_destination = aws_cloudwatch_log_group.flow_logs.arn
log_destination_type = "cloud-watch-logs"
traffic_type = "ALL" # or "ACCEPT" or "REJECT"
vpc_id = module.vpc.vpc_id
iam_role_arn = aws_iam_role.flow_logs.arn
tags = {
Name = "vpc-flow-logs"
}
}
resource "aws_cloudwatch_log_group" "flow_logs" {
name = "/aws/vpc-flow-log/main-vpc"
retention_in_days = 90 # Adjust based on compliance needs
tags = {
Name = "vpc-flow-logs"
}
}
# VPC Flow Logs to S3 (cheaper for long-term storage)
resource "aws_flow_log" "vpc_flow_logs_s3" {
log_destination = aws_s3_bucket.flow_logs.arn
log_destination_type = "s3"
traffic_type = "ALL"
vpc_id = module.vpc.vpc_id
# Parquet format for Athena queries
destination_options {
file_format = "parquet"
per_hour_partition = true
}
tags = {
Name = "vpc-flow-logs-s3"
}
}

VPC Endpoints: Save Money on NAT Gateway

Here’s a common problem: Your EC2 instances in private subnets need to access AWS services like S3 or DynamoDB. Without VPC Endpoints, this traffic goes through your NAT Gateway, costing you $0.045/GB in data processing fees. For high-traffic applications, this can mean hundreds of dollars per month in unnecessary costs.

The solution: VPC Endpoints provide private connectivity to AWS services without going through the NAT Gateway or internet. Traffic stays on AWS’s private network, improving security and reducing costs.

Two types of VPC Endpoints:

  1. Gateway Endpoints (FREE): For S3 and DynamoDB only
  2. Interface Endpoints ($7/month + data): For most other AWS services

Cost savings example: If your application transfers 1TB/month to S3 through NAT Gateway:

  • Without VPC Endpoint: 1000 GB × $0.045 = $45/month
  • With Gateway Endpoint: $0/month

That’s $540/year saved per TB of S3 traffic!

Gateway Endpoints (FREE!)

Gateway Endpoints for S3 and DynamoDB are completely free and should be deployed in every VPC. They work by adding routes to your route tables that direct traffic destined for these services through AWS’s private network instead of the NAT Gateway. This not only saves money but also improves security by keeping traffic off the public internet and reduces latency for high-throughput workloads.

S3 and DynamoDB Gateway Endpoints
terraform
# S3 Gateway Endpoint - FREE, no hourly charges
resource "aws_vpc_endpoint" "s3" {
vpc_id = module.vpc.vpc_id
service_name = "com.amazonaws.${var.region}.s3"
vpc_endpoint_type = "Gateway"
# Associate with private subnet route tables
route_table_ids = module.vpc.private_route_table_ids
tags = {
Name = "s3-gateway-endpoint"
}
}
# DynamoDB Gateway Endpoint - FREE
resource "aws_vpc_endpoint" "dynamodb" {
vpc_id = module.vpc.vpc_id
service_name = "com.amazonaws.${var.region}.dynamodb"
vpc_endpoint_type = "Gateway"
route_table_ids = module.vpc.private_route_table_ids
tags = {
Name = "dynamodb-gateway-endpoint"
}
}

Savings: $0.045/GB that would go through NAT Gateway

Interface Endpoints (Paid)

Interface Endpoints use AWS PrivateLink to create elastic network interfaces (ENIs) in your subnets that serve as entry points for traffic destined to supported AWS services. Unlike Gateway Endpoints, they support a wide range of services including ECR, CloudWatch, Secrets Manager, and many more. The trade-off is cost—each endpoint runs about $7/month plus data processing fees—but for services with high traffic volumes, the savings over NAT Gateway data processing fees can be substantial.

Cost: $0.01/hour ($7/month) + $0.01/GB data processed

ECR Interface Endpoints
terraform
# ECR API Endpoint
resource "aws_vpc_endpoint" "ecr_api" {
vpc_id = module.vpc.vpc_id
service_name = "com.amazonaws.${var.region}.ecr.api"
vpc_endpoint_type = "Interface"
subnet_ids = module.vpc.private_subnets
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true
tags = {
Name = "ecr-api-endpoint"
}
}
# ECR Docker Endpoint
resource "aws_vpc_endpoint" "ecr_dkr" {
vpc_id = module.vpc.vpc_id
service_name = "com.amazonaws.${var.region}.ecr.dkr"
vpc_endpoint_type = "Interface"
subnet_ids = module.vpc.private_subnets
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true
tags = {
Name = "ecr-dkr-endpoint"
}
}
# Security group for VPC endpoints
resource "aws_security_group" "vpc_endpoints" {
name = "vpc-endpoints-sg"
description = "Security group for VPC endpoints"
vpc_id = module.vpc.vpc_id
ingress {
description = "HTTPS from VPC"
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = [module.vpc.vpc_cidr_block]
}
tags = {
Name = "vpc-endpoints-sg"
}
}

When to use Interface Endpoints:

  • High traffic to AWS services (>100GB/month)
  • Break-even point: ~$7/month endpoint vs NAT Gateway data processing
  • Services: ECR, CloudWatch Logs, Systems Manager, Secrets Manager

Transit Gateway: Hub-and-Spoke Architecture

As your AWS environment grows, connecting multiple VPCs becomes complex. VPC Peering works for 2-3 VPCs, but with 5+ VPCs, you’d need to create dozens of peering connections (N×(N-1)/2 connections). Transit Gateway solves this by acting as a central hub—each VPC connects once to the Transit Gateway, and the Transit Gateway handles routing between all VPCs.

Why use Transit Gateway:

  • Simplified connectivity: Connect 10 VPCs with 10 attachments instead of 45 peering connections
  • Centralized routing: Manage all inter-VPC routing in one place
  • On-premises integration: Connect your data center once, reach all VPCs
  • Scalability: Supports thousands of VPCs per Transit Gateway
  • Segmentation: Use route tables to control which VPCs can talk to each other

How it works:

  1. Create a Transit Gateway in your region
  2. Attach VPCs to the Transit Gateway (one attachment per VPC)
  3. Configure route tables on the Transit Gateway to control traffic flow
  4. Update VPC route tables to send inter-VPC traffic to the Transit Gateway

Cost: $0.05/hour ($36/month per Transit Gateway) + $0.02/GB data processed

Break-even point: Transit Gateway becomes cost-effective when you have 3+ VPCs that need to communicate. Below that, VPC Peering (free) is usually cheaper.

Implementation:

Setting up Transit Gateway involves creating the gateway itself, attaching your VPCs, and updating route tables in each VPC to direct inter-VPC traffic through the Transit Gateway. The configuration below shows a typical setup with production and development VPCs. Note that you attach subnets (not the VPC directly) to the Transit Gateway—typically your private subnets since that’s where most inter-VPC traffic originates.

Transit Gateway Hub-and-Spoke Setup
terraform
# Transit Gateway
resource "aws_ec2_transit_gateway" "main" {
description = "Main Transit Gateway"
default_route_table_association = "enable"
default_route_table_propagation = "enable"
dns_support = "enable"
vpn_ecmp_support = "enable"
tags = {
Name = "main-tgw"
}
}
# Attach Production VPC
resource "aws_ec2_transit_gateway_vpc_attachment" "production" {
subnet_ids = module.production_vpc.private_subnets
transit_gateway_id = aws_ec2_transit_gateway.main.id
vpc_id = module.production_vpc.vpc_id
dns_support = "enable"
tags = {
Name = "tgw-attachment-production"
}
}
# Attach Development VPC
resource "aws_ec2_transit_gateway_vpc_attachment" "development" {
subnet_ids = module.development_vpc.private_subnets
transit_gateway_id = aws_ec2_transit_gateway.main.id
vpc_id = module.development_vpc.vpc_id
dns_support = "enable"
tags = {
Name = "tgw-attachment-development"
}
}
# Route from Production VPC to Transit Gateway
resource "aws_route" "production_to_tgw" {
count = length(module.production_vpc.private_route_table_ids)
route_table_id = module.production_vpc.private_route_table_ids[count.index]
destination_cidr_block = "10.0.0.0/8" # All internal traffic
transit_gateway_id = aws_ec2_transit_gateway.main.id
}

When to Use Transit Gateway

  • Multiple VPCs - 3+ VPCs that need to communicate with each other
  • Hybrid cloud - On-premises connectivity via VPN or Direct Connect
  • Centralized control - Need centralized routing and security inspection
  • Cost consideration - Break-even vs VPC Peering at ~3+ VPCs

Direct Connect: Dedicated Network Connection

Direct Connect is AWS’s solution for dedicated, private connectivity between your data center and AWS. Unlike VPN connections that go over the public internet, Direct Connect uses a dedicated fiber connection, providing predictable performance and lower latency.

Why use Direct Connect instead of VPN:

  • Higher bandwidth: 1 Gbps, 10 Gbps, or 100 Gbps vs VPN’s 1.25 Gbps per tunnel
  • Consistent performance: Dedicated bandwidth with predictable latency
  • Lower data transfer costs: $0.02/GB vs $0.09/GB for internet egress
  • Better security: Traffic never touches the public internet
  • Compliance: Some regulations require private connectivity

How it works:

  1. Order a Direct Connect port at an AWS Direct Connect location (colocation facility)
  2. Work with a network provider to establish physical connectivity
  3. Create a Virtual Interface (VIF) to connect to your VPC or Transit Gateway
  4. Configure BGP routing between your router and AWS
  5. Traffic flows over the dedicated connection

Setup time: Typically 2-4 weeks (requires physical circuit provisioning)

Common use cases:

  • Data migration: Transfer large datasets to AWS (faster than internet upload)
  • Hybrid applications: Low-latency connectivity for hybrid cloud workloads
  • Disaster recovery: Reliable connection for replication and backup
  • Production workloads: Predictable performance for mission-critical applications

Cost:

  • Port hours: $0.30/hour for 1Gbps = $216/month
  • Data transfer out: $0.02/GB (cheaper than internet)

Comparison: Direct Connect vs Site-to-Site VPN

Bandwidth:

  • Direct Connect: 1 Gbps, 10 Gbps, or 100 Gbps dedicated
  • Site-to-Site VPN: Up to 1.25 Gbps per tunnel (can use multiple tunnels)

Latency:

  • Direct Connect: Low and consistent (private connection)
  • Site-to-Site VPN: Higher and variable (depends on internet)

Cost:

  • Direct Connect: $216/month (1 Gbps port) + $0.02/GB egress
  • Site-to-Site VPN: $36/month (VPN connection) + $0.09/GB egress

Setup Time:

  • Direct Connect: 2-4 weeks (physical circuit provisioning)
  • Site-to-Site VPN: Minutes (fully self-service)

Reliability:

  • Direct Connect: 99.9% SLA (recommend VPN backup)
  • Site-to-Site VPN: 99.95% SLA (two tunnels for HA)

Best for:

  • Direct Connect: Production workloads, large data transfers, consistent performance needs
  • Site-to-Site VPN: Dev/test environments, backup connectivity, quick setup

Pro tip: Use both! Direct Connect for primary connectivity with VPN as backup.

Implementation:

Direct Connect requires a Direct Connect Gateway to connect your on-premises network to multiple VPCs through a Transit Gateway. The configuration below shows how to set up the gateway association and a backup VPN connection. In production, you should always have a VPN backup because Direct Connect is a single physical connection that can fail. The VPN automatically takes over if the Direct Connect link goes down.

Direct Connect with VPN Backup
terraform
# Direct Connect Gateway
resource "aws_dx_gateway" "main" {
name = "main-dx-gateway"
amazon_side_asn = 64512
}
# Associate with Transit Gateway
resource "aws_dx_gateway_association" "main" {
dx_gateway_id = aws_dx_gateway.main.id
associated_gateway_id = aws_ec2_transit_gateway.main.id
allowed_prefixes = [
"10.0.0.0/8",
"172.16.0.0/12"
]
}
# Site-to-Site VPN as backup
resource "aws_vpn_connection" "backup" {
customer_gateway_id = aws_customer_gateway.main.id
transit_gateway_id = aws_ec2_transit_gateway.main.id
type = "ipsec.1"
static_routes_only = false
tags = {
Name = "backup-vpn"
}
}

AWS Networking Cost Optimization

Monthly Cost Breakdown Example

Production VPC (us-east-1):
NAT Gateways (3 AZs):
- Hourly: 3 × $0.045 × 730 hours = $98.55
- Data processing: 500GB × $0.045 = $22.50
- Subtotal: $121.05
VPC Endpoints (Interface):
- ECR API: $7.30
- ECR DKR: $7.30
- CloudWatch Logs: $7.30
- Subtotal: $21.90
Load Balancers:
- Application Load Balancer: $16.20 + $8/LCU
- Subtotal: ~$25
Data Transfer:
- Inter-AZ: 200GB × $0.01 = $2.00
- Internet egress: 1TB × $0.09 = $90.00
- Subtotal: $92.00
Total Monthly Cost: ~$260

Cost Optimization Strategies

Add Gateway Endpoints (FREE)

Deploy S3 and DynamoDB Gateway Endpoints at no cost to eliminate NAT Gateway charges for AWS service traffic. Can save $0.045/GB in data processing fees.

Single NAT Gateway for Dev/Test

Use one NAT Gateway instead of three in non-production environments to save ~$64/month. Accept the reduced availability for cost savings.

Release Unused Resources

Delete unused Elastic IPs ($3.60/month each) and load balancers ($16-25/month each). Regular audits prevent waste.

VPC Peering Over Transit Gateway

For simple connectivity between 2-3 VPCs, use VPC Peering (free) instead of Transit Gateway ($36/month + data charges).

Interface Endpoints for High Traffic

Add Interface Endpoints ($7/month) for services with >100GB/month traffic. Break-even point makes this cost-effective for ECR, CloudWatch, etc.

Monitor Data Transfer Patterns

Use VPC Flow Logs and Cost Explorer to identify and optimize expensive cross-AZ and internet data transfer patterns.

DNS and Service Discovery

Proper DNS configuration is essential for service-to-service communication. AWS provides several options depending on your architecture.

Route 53 Private Hosted Zones

Private hosted zones allow you to use custom DNS names within your VPC without exposing them to the public internet. This is useful for giving friendly names to internal services, databases, and other resources. You can associate a private hosted zone with multiple VPCs, allowing resources in different VPCs to resolve the same DNS names—useful for shared services architectures.

Route 53 Private Hosted Zone
terraform
# Private hosted zone
resource "aws_route53_zone" "private" {
name = "internal.mycompany.com"
vpc {
vpc_id = module.vpc.vpc_id
}
# Associate with additional VPCs if needed
lifecycle {
ignore_changes = [vpc]
}
tags = {
Name = "internal-dns"
}
}
# DNS record for internal service
resource "aws_route53_record" "api" {
zone_id = aws_route53_zone.private.zone_id
name = "api.internal.mycompany.com"
type = "A"
alias {
name = aws_lb.internal.dns_name
zone_id = aws_lb.internal.zone_id
evaluate_target_health = true
}
}
# DNS record for database
resource "aws_route53_record" "db" {
zone_id = aws_route53_zone.private.zone_id
name = "db.internal.mycompany.com"
type = "CNAME"
ttl = 300
records = [aws_db_instance.main.address]
}

Cross-VPC DNS Resolution

When using Transit Gateway or VPC Peering, you may need DNS resolution across VPCs so that resources in one VPC can resolve private DNS names from another VPC. This requires enabling DNS resolution on the peering connection and associating the private hosted zone with the peer VPC. Without this configuration, DNS queries for private hosted zone records will fail from the peered VPC.

Cross-VPC DNS Resolution
terraform
# Enable DNS resolution for VPC peering
resource "aws_vpc_peering_connection_options" "requester" {
vpc_peering_connection_id = aws_vpc_peering_connection.main.id
requester {
allow_remote_vpc_dns_resolution = true
}
}
resource "aws_vpc_peering_connection_options" "accepter" {
vpc_peering_connection_id = aws_vpc_peering_connection.main.id
accepter {
allow_remote_vpc_dns_resolution = true
}
}
# Associate private hosted zone with peered VPC
resource "aws_route53_zone_association" "peer" {
zone_id = aws_route53_zone.private.zone_id
vpc_id = module.peer_vpc.vpc_id
}

Common AWS Networking Issues

Issue 1: Can’t SSH to EC2 in Private Subnet

Solution: Use AWS Systems Manager Session Manager (no SSH port needed!)

Session Manager provides secure shell access to your EC2 instances without opening inbound ports, managing SSH keys, or using bastion hosts. It works through the SSM agent that’s pre-installed on Amazon Linux and can be installed on other operating systems. All sessions are logged to CloudWatch and S3 for audit purposes.

SSM Session Manager IAM Configuration
terraform
# IAM role for EC2 instances
resource "aws_iam_role" "ec2_ssm" {
name = "ec2-ssm-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "ec2.amazonaws.com"
}
}]
})
}
# Attach SSM policy
resource "aws_iam_role_policy_attachment" "ec2_ssm" {
role = aws_iam_role.ec2_ssm.name
policy_arn = "arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore"
}
# Instance profile
resource "aws_iam_instance_profile" "ec2_ssm" {
name = "ec2-ssm-profile"
role = aws_iam_role.ec2_ssm.name
}

Connect via CLI:

Terminal window
aws ssm start-session --target i-1234567890abcdef0

Issue 2: High NAT Gateway Costs

Diagnosis:

High NAT Gateway costs usually come from unexpected traffic patterns—often AWS service calls that could go through VPC endpoints instead. Use VPC Flow Logs to identify the top traffic sources and destinations, then add appropriate VPC endpoints for AWS services or optimize application behavior for external API calls.

AWS CLI & SQL: Diagnosing NAT Gateway Traffic
bash
Terminal window
# Check VPC Flow Logs for traffic patterns
aws ec2 describe-flow-logs --filter "Name=resource-id,Values=vpc-xxxxx"
# Analyze with Athena
SELECT
srcaddr,
dstaddr,
SUM(bytes) as total_bytes
FROM vpc_flow_logs
WHERE action = 'ACCEPT'
GROUP BY srcaddr, dstaddr
ORDER BY total_bytes DESC
LIMIT 20;

Solution: Add VPC Endpoints for AWS services

Issue 3: VPC Peering Not Working

Checklist:

VPC peering issues usually fall into one of four categories: the peering connection isn’t active, route tables aren’t configured correctly, security groups don’t allow traffic from the peer VPC CIDR, or the VPCs have overlapping CIDR blocks. Work through this checklist systematically to identify the root cause.

AWS CLI: VPC Peering Troubleshooting Commands
bash
Terminal window
# 1. Check peering connection status
aws ec2 describe-vpc-peering-connections \
--filters "Name=status-code,Values=active"
# 2. Verify route tables in BOTH VPCs
aws ec2 describe-route-tables \
--filters "Name=route.destination-cidr-block,Values=10.1.0.0/16"
# 3. Check security groups allow peer VPC CIDR
aws ec2 describe-security-groups --group-ids sg-xxxxx
# 4. Verify no overlapping CIDR blocks

Troubleshooting Tools

VPC Reachability Analyzer

VPC Reachability Analyzer is a powerful diagnostic tool that tests connectivity between resources without sending actual network traffic. It analyzes your VPC configuration—route tables, security groups, NACLs, and more—to determine if a path exists between a source and destination. This is invaluable for troubleshooting connectivity issues before or after deployment.

AWS CLI: VPC Reachability Analyzer
bash
Terminal window
# Create analysis
aws ec2 create-network-insights-path \
--source i-source-instance \
--destination i-dest-instance \
--protocol tcp \
--destination-port 443
# Start analysis
aws ec2 start-network-insights-analysis \
--network-insights-path-id nip-xxxxx
# Get results
aws ec2 describe-network-insights-analyses \
--network-insights-analysis-ids nia-xxxxx

VPC Flow Logs Analysis

VPC Flow Logs combined with Athena provide powerful network traffic analysis capabilities. You can identify top talkers to understand bandwidth usage, find rejected connections to diagnose security group issues, and track traffic patterns over time. The queries below are starting points—customize them based on your specific troubleshooting needs.

SQL: Athena Queries for VPC Flow Log Analysis
sql
-- Top talkers
SELECT
srcaddr,
dstaddr,
SUM(bytes) as total_bytes,
COUNT(*) as packet_count
FROM vpc_flow_logs
WHERE date = '2024-06-23'
GROUP BY srcaddr, dstaddr
ORDER BY total_bytes DESC
LIMIT 20;
-- Rejected connections (security issues)
SELECT
srcaddr,
dstaddr,
srcport,
dstport,
COUNT(*) as reject_count
FROM vpc_flow_logs
WHERE action = 'REJECT'
GROUP BY srcaddr, dstaddr, srcport, dstport
ORDER BY reject_count DESC;

Next Steps

You now have the knowledge to build production-ready networks on AWS! For deploying specific AWS services, check out our companion guide:

Continue learning

Additional AWS Resources

Need help? Contact Quabyt for AWS networking architecture and implementation support.

Back to Blog