TCP Connection Intermittent Failures

Problem Statement:

Some of the TCP connections from instances in a private subnet to a specific destination through a

NAT gateway are successful, but some are failing or timing out.



The cause of this problem might be one of the following:

The destination endpoint is responding with fragmented TCP packets. NAT gateways do

not support IP fragmentation for TCP or ICMP.

The tcp_tw_recycle option is enabled on the remote server, which is known to cause

issues when there are multiple connections from behind a NAT device.


What it is?

The tcp_tw_recycle option is a Boolean setting that enables fast recycling of TIME_WAIT sockets.

The default value is 0. When enabled, the kernel becomes more aggressive and makes assumptions

about the timestamps used by remote hosts. It tracks the last timestamp used by each remote host

and allows the reuse of a socket if the timestamp has increased.



Verify whether the endpoint to which you’re trying to connect is responding with fragmented TCP

packets by doing the following:

1. Use an instance in a public subnet with a public IP address to trigger a response large

enough to cause fragmentation from the specific endpoint.


2. Use the tcpdump utility to verify that the endpoint is sending fragmented packets.


You must use an instance in a public subnet to perform these checks. You cannot use the instance

from which the original connection was failing, or an instance in a private subnet behind a NAT

gateway or a NAT instance.


Diagnostic tools that send or receive large ICMP packets will report packet loss. For

example, the command ping -s 10000 does not work behind a NAT gateway.


3. If the endpoint is sending fragmented TCP packets, you can use a NAT instance instead of a

NAT gateway.


If you have access to the remote server, you can verify whether the tcp_tw_recycle option is

enabled by doing the following:

1. From the server, run the following command.

cat /proc/sys/net/ipv4/tcp_tw_recycle

If the output is 1, then the tcp_tw_recycle option is enabled.

2. If tcp_tw_recycle is enabled, we recommend disabling it. If you need to reuse

connections, tcp_tw_reuse is a safer option.

If you don’t have access to the remote server, you can test by temporarily disabling

the tcp_timestamps option on an instance in the private subnet. Then connect to the remote server

again. If the connection is successful, the cause of the previous failure is likely

because tcp_tw_recycle is enabled on the remote server. If possible, contact the owner of the

remote server to verify if this option is enabled and request for it to be disabled.

Leave a Comment

Your email address will not be published. Required fields are marked *