We have been seeing some issues recently where we would get what appeared to be cluster membership changes (and then migrations, which would impact latency).
The logs would report messages like this: May 29 2019 18:41:28 GMT: WARNING (hb): (hb.c:5138) (repeated:1) sending mesh message to 1117 on fd 736 failed : Broken pipe
We were seeing these quite frequently (1000s of times an hour across the cluster), and sometimes (but not always) these would cause a node to completely disappear and get evicted from the cluster for a short period of time (maybe 10-30 seconds). Inspection of the traffic (via tcpdump) suggested that packets would just go missing for a period of time, and while the possibility of hardware (or cable) issues was there, it didn’t seem likely to affect all of the nodes across multiple clusters.
This was most prevalent in our larger clusters (the specific cluster we have been tuning has 20 nodes, each with 768GB of RAM, and we are using TCP heartbeats – I wish we had gone with UDP, but we are stuck with TCP for now). Strangely, it was most apparent in a cluster that has only XDR traffic, no direct IOs.
It was our theory that we were processing packets too slowly, so in an effort to tune this we changed the following settings:
- Firstly we set tuned to run the network-latency profile. This does a number of things, but by itself wasn’t enough. So…
- sysctl vm.swappiness=0 – recommended here: Tuning Kernel Memory for Performance
- %5 of ram > /proc/sys/vm/min_free_kbytes – also from the above link
- sysctl net.ipv4.tcp_slow_start_after_idle=0
- sysctl net.ipv4.tcp_no_metrics_save=1
- sysctl net.core.netdev_max_backlog=30000 The above three are from various sources (note that we use 10GigE, so these may not be applicable): https://cromwell-intl.com/open-source/performance-tuning/tcp.html https://github.com/ton31337/tools/wiki/tcp_slow_start_after_idle---tcp_no_metrics_save-performance https://darksideclouds.wordpress.com/2016/10/10/tuning-10gb-nics-highway-to-hell/
It should also be noted that step #3 caused momentary loss of network on each node as we set it, so be careful doing this on live clusters. That also gives a pretty strong hint as to the performance issue we were seeing… apparently walking the page table to flush cached memory was taking a significant amount of time, and causing us to lose packets. The other settings helped, but it was #3 that finally reduced the broken connections down to nothing.
5% might be a little on the high side, and possibly 1-3% would have been enough, but due to the impact this has, I’m reluctant to experiment too much.
Suffice to say, this is going to be part of our standard tuning for AS nodes now, and hopefully this helps someone else.