Cloud Network Troubleshooting: VPC Flow Logs, NSG Diagnostics, and Packet Mirroring

Cloud Networking Fails Silently

On-premises, when two servers cannot talk to each other, you plug in a serial cable, run tcpdump, and see exactly what is happening on the wire. In the cloud, there is no wire. Traffic traverses virtual network interfaces, security groups, NACLs, route tables, NAT gateways, load balancers, VPC peering connections, transit gateways, and private endpoints -- any one of which can silently drop or block traffic without returning an error to the sender.

Cloud network troubleshooting is fundamentally different from traditional network troubleshooting because the network is software-defined. You cannot run Wireshark on a virtual switch. You cannot traceroute through a NAT gateway. The tools you have are flow logs, reachability analyzers, and connectivity tests -- all of which have limitations and blind spots.

This guide covers the diagnostic tools available on each cloud, how to use them effectively, and step-by-step debugging for the most common cloud networking issues.

VPC Flow Logs: Your Primary Diagnostic Tool

Flow logs capture metadata about network traffic -- source IP, destination IP, source port, destination port, protocol, action (ACCEPT or REJECT), and byte/packet counts. They do not capture packet contents. Think of them as a call log for your network: you can see who called whom, when, and whether the call was answered, but you cannot hear the conversation.

AWS VPC Flow Logs

AWS flow logs can be attached at three levels: VPC (captures all traffic), subnet, or individual network interface. They publish to CloudWatch Logs, S3, or Kinesis Data Firehose. The default format includes 14 fields; the custom format supports up to 29 fields including traffic path, sublocation type, and VPC endpoint IDs.

Cost trap: Flow logs generate a lot of data. A busy VPC with 100 instances can produce 50 to 100 GB of flow logs per day. At CloudWatch Logs pricing ($0.50 per GB ingested), that is $25 to $50 per day or $750 to $1,500 per month. Publishing to S3 is much cheaper ($0.023 per GB stored) but adds query latency when you need to investigate an issue. My recommendation: send flow logs to S3 for retention and set up Athena queries for investigation. Only send a filtered subset to CloudWatch for real-time alerting.

Key limitation: VPC flow logs do not capture traffic to or from link-local addresses (169.254.x.x), DHCP traffic, traffic to the VPC DNS server, or traffic to the instance metadata service. If your debugging involves any of these, flow logs will not help.

Azure NSG Flow Logs

Azure flow logs attach to Network Security Groups, not to VNets directly. This means you get flow data at the NSG level, which is both more granular (you see which NSG rule matched) and more limited (you need flow logs on every NSG to get full coverage). NSG flow logs publish to Azure Storage accounts and can be analyzed with Traffic Analytics in Azure Monitor.

Traffic Analytics is Azure's killer feature for network troubleshooting. It processes raw flow logs into a topology view showing traffic patterns between subnets, VNets, regions, and the internet. You can see the top talkers, blocked flows, and bandwidth utilization without writing a single query. The processing delay is 10 to 60 minutes depending on the analytics interval, so it is not real-time, but for post-incident investigation it is invaluable.

GCP VPC Flow Logs

GCP flow logs attach to subnets and capture traffic on all VMs in that subnet. They publish to Cloud Logging and can be exported to BigQuery for analysis. GCP flow logs are unique in that they include the RTT (round-trip time) for TCP connections, which makes them useful for latency debugging as well as connectivity debugging.

Sampling: GCP flow logs sample traffic rather than capturing every flow. The default sample rate is 50 percent, which is sufficient for troubleshooting most issues but can miss intermittent problems. You can increase the sample rate to 100 percent for specific subnets during active debugging, but this increases logging costs.

Configure Azure Network Watcher flow logs with the right retention and analytics settings

Reachability and Connectivity Testing Tools

Flow logs tell you what happened. Reachability tools tell you what should happen based on your current configuration. They analyze your security groups, NACLs, route tables, and peering configurations to determine whether traffic can theoretically flow between two endpoints.

AWS VPC Reachability Analyzer

Reachability Analyzer is the single most useful network debugging tool on AWS. You specify a source and destination (instances, ENIs, internet gateways, VPN gateways, Transit Gateway attachments, or VPC endpoints) and it traces the path between them, checking every hop for rules that would block traffic.

What it checks: Security group rules, NACL rules, route table entries, VPC peering configuration, Transit Gateway route tables, and prefix list membership. If traffic is blocked, it tells you exactly which component blocked it and which rule was responsible.

What it does not check: Host-level firewalls (iptables, Windows Firewall), application-level issues (the service is not listening on that port), DNS resolution, and TLS handshake issues. Reachability Analyzer says "the network would allow this traffic" -- it does not say "the application will accept this connection."

Pricing: $0.10 per analysis. This adds up if you are running analyses programmatically, but for manual debugging it is negligible.

Azure Network Watcher

Network Watcher is Azure's network diagnostic suite. It includes several tools:

IP flow verify: Checks whether a packet is allowed or denied to or from a VM. It tells you which NSG rule matched. This is the Azure equivalent of Reachability Analyzer but limited to single-hop checks.
Next hop: Shows the next hop type and IP for traffic from a VM to a destination. Essential for debugging routing issues -- "why is my traffic going through the NVA instead of the VNet gateway?"
Connection troubleshoot: Attempts an actual TCP connection from a VM to a destination and reports the result, latency, and any hops along the way. This is more thorough than IP flow verify because it tests actual connectivity, not just configuration.
Packet capture: Captures packets on a VM's network interface. This is the closest thing to tcpdump in the cloud. You can filter by protocol, source, destination, and port, and download the capture as a PCAP file for analysis in Wireshark.

GCP Connectivity Tests

Connectivity Tests analyze your VPC configuration to determine whether traffic can flow between two endpoints. Like Reachability Analyzer, they check firewall rules, routes, and VPC peering configuration. They also check Cloud NAT configuration and VPC Service Controls perimeters, which are common sources of connectivity issues on GCP.

Unique feature: Connectivity Tests can test reachability to Google APIs through Private Google Access and Private Service Connect, which helps debug "why can my VM not reach the Cloud Storage API through the private endpoint?"

Plan VPC CIDR ranges to avoid overlap and ensure proper routing

Debugging Scenario 1: Instances Cannot Talk to Each Other

This is the most common cloud networking issue. Two instances that should be able to communicate cannot. Here is the systematic debugging approach.

Step 1: Verify Basic Connectivity Requirements

Are the instances in the same VPC? If not, is there a peering connection, Transit Gateway attachment, or VPN between their VPCs?
Are they in the same subnet? If not, do the route tables for both subnets have routes that allow traffic between them?
Is the target instance running and has a private IP assigned?

Step 2: Check Security Groups and Firewalls

Security groups are stateful -- if outbound traffic is allowed, the return traffic is automatically allowed. NACLs (AWS), NSGs (Azure), and firewall rules (GCP) may be stateful or stateless depending on the cloud.

The #1 cause of connectivity issues: The security group on the destination instance does not allow inbound traffic from the source on the required port. Verify that the destination's security group allows inbound traffic from the source's security group ID (not just the source IP -- security group references are more maintainable and survive instance replacement).

On AWS, also check NACLs. NACLs are stateless, which means you need both an inbound rule on the destination subnet and an outbound rule on the source subnet. NACLs also require ephemeral port rules for return traffic (typically ports 1024 to 65535).

Step 3: Check Route Tables

Verify that both subnets have route table entries for each other's CIDR ranges. In a simple VPC, the local route (which covers the entire VPC CIDR) handles this automatically. But if you have custom routing (through a firewall appliance, NAT gateway, or VPC endpoint), routes might be missing or pointing to the wrong target.

A common issue: a route table has a more specific route that captures traffic intended for another subnet. For example, a route for 10.0.0.0/16 pointing to a firewall appliance will capture traffic destined for 10.0.2.0/24 even if there is a local route for 10.0.0.0/8. More specific routes always win.

Check for CIDR range overlaps that cause routing conflicts

Step 4: Use Platform-Specific Tools

Run AWS Reachability Analyzer, Azure Connection Troubleshoot, or GCP Connectivity Tests between the two instances. These tools will identify the exact component blocking traffic.

Step 5: Check the Application Layer

If the network tools say traffic should flow but the application still cannot connect, the issue is above the network layer. SSH into the target instance and verify that the service is listening on the expected port (ss -tlnp on Linux, netstat -an on Windows). Check host-level firewalls (iptables, ufw, Windows Firewall). Verify that the application is binding to 0.0.0.0 or the instance's private IP, not just 127.0.0.1.

DNS is always a suspect

Before diving into network debugging, verify that DNS resolution is working correctly. Run nslookup or dig from the source instance to confirm that the destination hostname resolves to the expected IP address. Many "connectivity issues" are actually DNS issues -- the hostname is resolving to the wrong IP, an old IP, or not resolving at all. Private DNS zones, split-horizon DNS, and VPC DNS settings are common culprits.

Debugging Scenario 2: Intermittent Packet Loss

Intermittent issues are the hardest to debug because they are difficult to reproduce and the symptoms are often vague -- "the application is slow sometimes" or "requests occasionally time out."

Identifying the Pattern

The first step is determining whether the packet loss is time-based, load-based, or random.

Time-based: If packet loss happens at the same time every day, look for scheduled jobs (batch processing, backups, log rotation) that saturate network bandwidth or CPU during those windows.
Load-based: If packet loss correlates with traffic volume, you are hitting a bandwidth or throughput limit. Check instance network performance (AWS instance types have documented baseline and burst bandwidth), NAT gateway throughput limits (45 Gbps per gateway on AWS, but per-flow limits are much lower), and load balancer connection limits.
Random: Truly random packet loss is rare in cloud environments. Check for microbursts that exceed the instance's network credit balance (applicable to burstable instance types), health check failures causing load balancer to remove healthy instances temporarily, and DNS resolution failures.

AWS-Specific Causes

Network credit exhaustion on burstable instances: t3 and t4g instances have a baseline network bandwidth and earn credits when idle. Under sustained load, credits deplete and network performance drops to the baseline. Monitor the NetworkBandwidthInAllowanceExceeded and NetworkBandwidthOutAllowanceExceeded CloudWatch metrics.
NAT Gateway throttling: Each NAT Gateway supports up to 55,000 simultaneous connections to a single destination IP. If your application opens many connections to a single endpoint (like a database), you can hit this limit. The symptom is intermittent connection timeouts and retries.
Security group connection tracking limits: Each instance has a limit of 65,535 tracked connections per security group. High-throughput applications (load balancers, proxies) can exceed this limit.

Azure-Specific Causes

SNAT port exhaustion on load balancers: Azure Load Balancer allocates a fixed number of SNAT ports per backend instance (default 1,024 per VM per frontend IP). When all SNAT ports are in use, new outbound connections fail. This is the single most common cause of intermittent connectivity issues in Azure. The fix is to use NAT Gateway instead of load balancer SNAT, or increase the allocated ports.
Accelerated Networking misconfiguration: If Accelerated Networking is enabled on the VM SKU but the OS driver is not properly loaded, network performance can be degraded. Verify with ethtool -S eth0 and check for high numbers in the tx_dropped or rx_dropped counters.

GCP-Specific Causes

Per-VM egress bandwidth limits: GCP VMs have per-VM egress bandwidth limits that vary by machine type. An e2-standard-2 has a maximum egress of 4 Gbps. If your application exceeds this, packets are dropped silently.
Firewall rule logging: Enable firewall rule logging temporarily on rules that should allow traffic. If you see DENY log entries for traffic that should be allowed, you have a rule ordering issue (a higher-priority deny rule is matching before your allow rule).

Packet mirroring is the last resort

When flow logs and reachability tools are not enough, all three clouds offer packet mirroring (AWS Traffic Mirroring, Azure Packet Capture, GCP Packet Mirroring). This captures actual packet contents and lets you analyze them in Wireshark. Use this for diagnosing TLS handshake failures, application protocol issues, and packet corruption. But be aware of the cost -- mirrored traffic is billed at standard data transfer rates, and capturing all traffic on a busy instance generates gigabytes of data per hour.

Debugging Scenario 3: Cannot Reach External Services

Your instances cannot reach the internet, a SaaS API, or a cloud provider service endpoint.

Private Subnets Need Explicit Egress

Instances in private subnets (no internet gateway in the route table) cannot reach the internet unless you provide an egress path. The options are:

NAT Gateway/NAT Instance: Routes outbound traffic through a public subnet. The most common approach for general internet access.
VPC Endpoints / Private Endpoints / Private Service Connect: Provides private connectivity to specific cloud services (S3, DynamoDB, etc.) without going through a NAT gateway. This is cheaper and more secure for cloud service traffic.
Proxy server: Routes traffic through an HTTP/HTTPS proxy. Useful when you need to control and log which external services instances can reach.

The debugging checklist for external connectivity: (1) Does the route table have a route for 0.0.0.0/0 pointing to a NAT gateway, internet gateway, or proxy? (2) Does the security group allow outbound traffic on the required port? (3) Does the NACL allow outbound traffic and inbound return traffic on ephemeral ports? (4) If using a NAT gateway, is the NAT gateway in a public subnet with an Elastic IP? (5) If using VPC endpoints, is the endpoint in the same AZ as your instance, and does its policy allow the requests?

Building a Network Debugging Toolkit

These are the tools and practices I set up in every cloud environment before issues arise:

Enable VPC flow logs on every VPC. Publish to S3 with a 14-day lifecycle policy. The cost is manageable and the debugging value is immense.
Set up Athena/BigQuery queries for flow log analysis. Pre-write queries for "show me all REJECT flows for instance X," "show me all traffic to destination port 443 from subnet Y," and "show me traffic patterns between VPC A and VPC B."
Deploy a bastion or Systems Manager. You need a way to SSH into instances and run diagnostic commands (ping, traceroute, dig, curl, ss). AWS Systems Manager Session Manager, Azure Bastion, and GCP IAP tunneling all provide this without exposing SSH ports.
Document your network architecture. A diagram showing VPCs, subnets, peering connections, transit gateways, and route table summaries saves hours during debugging. Keep it updated when you make changes -- an outdated diagram is worse than no diagram.
Tag network resources. When Reachability Analyzer tells you that traffic is blocked by "sg-0a1b2c3d4e5f," you need to know what that security group is for. Consistent tagging with Name, Service, and Environment tags makes debugging dramatically faster.

Cloud networking issues will always be part of operating in the cloud. The difference between a 10-minute resolution and a 4-hour outage is whether you have the tools enabled, the queries pre-written, and the knowledge to interpret what they tell you. Set up your debugging toolkit before you need it, because the worst time to learn how flow logs work is during a production incident at midnight.