XDR, client or cluster intermittent network timeouts caused by MTU and ICMP issue

The Aerospike Knowledge Base has moved to https://support.aerospike.com. Content on https://discuss.aerospike.com is being migrated to either https://support.aerospike.com or https://docs.aerospike.com. Maintenance on articles stored in this repository ceased on December 31st 2022 and this article may be stale. If you have any questions, please do not hesitate to raise a case via https://support.aerospike.com.

XDR, client or cluster intermittent network timeouts caused by MTU and ICMP issue

Problem Description

In an otherwise healthy Cluster/XDR, some intermittent timeouts may occur. Additionally, the issue may have only appeared once write block size has been increased.

Explanation

Standard MTU size expected is 1500 bytes. Normally, with this size MTU, everything should work properly. If MTU along the path between nodes in the cluster or between XDR sites has been modified, incorrect MTU size issue may occur. Normally the kernel will handle MTU size issue as follows:

  1. Source sends packet with MTU too big for either the destination or one of the hops along the way.
  2. The packet reaches a server/router which has a smaller MTU configured. At this point one of 2 things will happen:

If the DF flag (Don’t Fragment) is NOT set, the box will perform fragmentation. As such, the packet will be chopped down to smaller-sized fragments and sent forward. This results in increased latency and network use. Not only does the router need to chop-up the packet, it also now needs to send multiple frames with the packet fragments. If a 3000-byte packet needs to be fragmented to fit a 1500-byte MTU, it will be chopped up to 2 packets. Each packet has got an 28-byte header, so the 3000+28=3028-byte network transfer is now (1500+28)*2=3056. Apart from this, everything should function as normal.

If the DF flag IS set, it indicates to all routers along the route to not attempt packet fragmentation. This is a much more elegant way of identifying optimal-sized MTU. Under certain conditions, the linux kernel may turn on the DF flag regardless of what the application chose for the 1st packet. In this case, once the packet reaches the router with an MTU which is smaller than the packet size, the router will drop the packet. At the same time, the router will respond to source with “too big” message, indicating the packet is too large to fit and needs to be reduced in size. The “too big” response does not come in the same TCP/UDP channel as the packet. Instead it is sent using the control protocol - ICMP. It is the same protocol as is used for echo (ping), with different type and code of the message. Same port. Once the source received the ICMP “too big”, it will decrease the size of the packet and resend. That way, optimal packet size can be reached. This solution also suffers increased network use, but does not suffer the same possible latency due to a router having to fragment packets and the destination having to recompile them. Setting correct MTU along the route is the most optimal way.

If ICMP is blocked, the issue may not present itself immediately. The “too big” control messages will no longer be served. This will result in packets larger than the max MTU along the path to be dropped without notification. As such, some packets, which are small enough to fit in the frame, will still go though. This may result in intermittent functionality of the cluster or XDR with frequent timeouts.

Solution

Firstly, identifying whether there is an MTU issue can be done by using the ping command. It also allows to to verify the second issue - whether ICMP is being blocked.

Send a small packet first, to identify whether ICMP is blocked. The ping should be performed from source to destination and back. So on a 2-node cluster XDR replicating to another 2-node cluster (with an XDR issue), ping should be performed from each node to the other 3 nodes.

$ ping REMOTE_IP

$ ping 10.0.2.2
PING 10.0.2.2 (10.0.2.2) 56(84) bytes of data.
64 bytes from 10.0.2.2: icmp_seq=1 ttl=64 time=0.199 ms
64 bytes from 10.0.2.2: icmp_seq=2 ttl=64 time=0.234 ms
64 bytes from 10.0.2.2: icmp_seq=3 ttl=64 time=0.234 ms
64 bytes from 10.0.2.2: icmp_seq=4 ttl=64 time=0.263 ms
^C

This indicates that the ping has gone through. Now, try identify the MTU on the machine from which we are doing the pings.

$ ip link ls
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP mode DEFAULT group default qlen 1000
    link/ether 08:00:27:a1:18:eb brd ff:ff:ff:ff:ff:ff

From this we can see that the mtu for my main enp0s3 interface that I use is 1500. Now send a ping of a 1500-byte size to all the remote destinations, as before.

$ ping -M do -s 1472 10.0.2.2
PING 10.0.2.2 (10.0.2.2) 1472(1500) bytes of data.
1480 bytes from 10.0.2.2: icmp_seq=1 ttl=64 time=0.244 ms
1480 bytes from 10.0.2.2: icmp_seq=2 ttl=64 time=0.203 ms
^C

Please note that I have sent a packet size 1472 instead of 1500. This indeed sends a frame of 1500, due to the mentioned 28-byte IP header. Always take away 28 when testing with ping. if at this stage you get either “ping: sendto: Message too long”, the MTU is too big and fragmentation or ICMP “too big” should have been occurring. This indicates that the problem described in this article is NOT the cause.

If this is the cause of the issue, firstly enable ICMP control port from all sides. It should be enabled for any normal function of the system. Once done, identify if you are in control of the hops in the route which have got MTU misconfiguration and consider adjusting the MTU size to match the rest of the route.

Notes

To improve performance, one can increase packet sizes and enable jumbo frames. This will result in less overhead when sending packets. For more information, refer to: Check Jumbo Frame Settings with Ping

Keywords

XDR CLIENT CLUSTER MTU ICMP PACKETS INTERMITTENT

Timestamp

5/17/17